Hello everyone, Following some initial discussions with Jarek Potiuk and a previously opened PR, I would like to formally propose the introduction of an Apache Arrow / ADBC provider for Airflow.
Context & Motivation: While Airflow has a rich set of database-specific providers, the data ecosystem is rapidly shifting toward ADBC (Arrow Database Connectivity). ADBC solves many of the "bottleneck" issues associated with traditional DB-API 2.0, ODBC or JDBC drivers by leveraging columnar data access and Arrow-native memory representation. We are seeing significant momentum here: * Performance: Significant reduction in serialization overhead for bulk operations. While results vary by driver maturity and server-side native Arrow support (e.g., flight endpoints), ADBC provides a much higher performance ceiling than standard PEP 249 drivers. * Standardization: Systems like Snowflake, Apache DataFusion and DuckDB are increasingly treating Arrow as a first-class citizen. * Future-proofing: Tools like dbt-fusion and various lakehouse architectures are moving toward Arrow-based execution. The Proposal: I propose adding an apache-airflow-providers-apache-arrow (or similar) that introduces an AdbcHook. Key Technical Highlights: * Compatibility: By implementing DbApiHook, the AdbcHook will be immediately compatible with existing SQL operators. * Efficiency: It will offer a high-performance alternative to traditional row-based drivers without requiring users to rewrite their DAG logic. * Scope: Focus on providing a standardized interface for Arrow-native bulk reads and writes (future enhancement in AdbcHook). Community & Maintenance: I have already started the groundwork in a Draft PR (#52330). I believe this aligns with the project's goal of supporting high-performance data engineering patterns. I'm looking for feedback on: * Naming: Should this be a standalone adbc provider or part of an apache.arrow provider? I chose the later but to be discussed. * Scope: At the moment I was only focusing purely on the Hook/Connection, as it extends the DbAPiHook and implements all required methods, it's already directly useable in SQL-operators. I'd love to gather your thoughts and gauge interest before moving to a formal voting thread. Draft PR: https://github.com/apache/airflow/pull/52330 Best regards, David
