Hello everyone,

Following some initial discussions with Jarek Potiuk and a previously opened 
PR, I would like to formally propose the introduction of an Apache Arrow / ADBC 
provider for Airflow.

Context & Motivation:

While Airflow has a rich set of database-specific providers, the data ecosystem 
is rapidly shifting toward ADBC (Arrow Database Connectivity).
ADBC solves many of the "bottleneck" issues associated with traditional DB-API 
2.0, ODBC or JDBC drivers by leveraging columnar data access and Arrow-native 
memory representation.

We are seeing significant momentum here:


  *   Performance: Significant reduction in serialization overhead for bulk 
operations. While results vary by driver maturity and server-side native Arrow 
support (e.g., flight endpoints), ADBC provides a much higher performance 
ceiling than standard PEP 249 drivers.
  *   Standardization: Systems like Snowflake, Apache DataFusion and DuckDB are 
increasingly treating Arrow as a first-class citizen.
  *   Future-proofing: Tools like dbt-fusion and various lakehouse 
architectures are moving toward Arrow-based execution.

The Proposal:

I propose adding an apache-airflow-providers-apache-arrow (or similar) that 
introduces an AdbcHook.

Key Technical Highlights:


  *   Compatibility: By implementing DbApiHook, the AdbcHook will be 
immediately compatible with existing SQL operators.
  *   Efficiency: It will offer a high-performance alternative to traditional 
row-based drivers without requiring users to rewrite their DAG logic.
  *   Scope: Focus on providing a standardized interface for Arrow-native bulk 
reads and writes (future enhancement in AdbcHook).

Community & Maintenance:

I have already started the groundwork in a Draft PR (#52330).

I believe this aligns with the project's goal of supporting high-performance 
data engineering patterns. I'm looking for feedback on:


  *   Naming: Should this be a standalone adbc provider or part of an 
apache.arrow provider?  I chose the later but to be discussed.
  *   Scope: At the moment I was only focusing purely on the Hook/Connection, 
as it extends the DbAPiHook and implements all required methods, it's already 
directly useable in SQL-operators.

I'd love to gather your thoughts and gauge interest before moving to a formal 
voting thread.

Draft PR: https://github.com/apache/airflow/pull/52330

Best regards,
David

Reply via email to