[GitHub] [arrow-site] lwhite1 commented on a diff in pull request #248: [Website] Add ADBC blog post

GitBox Thu, 15 Dec 2022 10:06:48 -0800


lwhite1 commented on code in PR #248:
URL: https://github.com/apache/arrow-site/pull/248#discussion_r1049963986



##########
_posts/2022-12-31-arrow-adbc.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Arrow community has accepted version 1.0.0 of the [Arrow Database 
Connectivity (ADBC)][adbc] specification.

Review Comment:
   I wonder if it's not possible to lead with a succinct statement of ADBC's 
benefits.  Even though this is hosted on the Arrow site, it may attract people 
who aren't that familiar with arrow. I would only lead with the acceptance news 
if you think the main audience is people in the community who have been waiting 
for the production release?



##########
_posts/2022-12-31-arrow-adbc.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Arrow community has accepted version 1.0.0 of the [Arrow Database 
Connectivity (ADBC)][adbc] specification.
+ADBC is an API standard for Arrow-based database access.
+It defines abstract APIs in C, Go, and Java for performing common database 
tasks using Arrow data, like executing queries and getting basic metadata.
+
+Just like time-tested [JDBC][jdbc] and [ODBC][odbc], ADBC defines 
database-independent interaction APIs, and relies on drivers to implement those 
APIs for particular databases.
+It aims to provide a single API that works with Arrow-native protocols, like 
[Arrow Flight SQL][flight-sql]; vendor-specific APIs that offer Arrow data, 
such as those offered by ClickHouse or Google BigQuery; and non-columnar 
protocols and APIs like the PostgreSQL wire format or JDBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system.
+Arrow-native systems can directly pass through data without conversion, while 
the driver takes care of converting data in other cases, saving the application 
from doing the work.
+
+## Examples
+
+*Note: implementations are still under development, separate from the API 
standard itself. Examples are subject to change.*
+
+We can get Arrow data out of a Postgres database in a few lines of Python:
+
+```python
+import adbc_driver_postgresql.dbapi
+
+uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
+with adbc_driver_postgresql.dbapi.connect(uri) as conn:
+    with conn.cursor() as cur:
+        cur.execute("SELECT * FROM customer")
+        table = cur.fetch_arrow_table()
+        # Process the results
+```
+
+The ADBC Python packages offer a familiar [DBAPI 2.0 (PEP 249)][pep-249]-style 
API, along with extensions to get Arrow data.
+
+In Java, we can pull Arrow data out of a JDBC connection.
+The ADBC driver takes care of converting the data for the application:
+
+```java
+final Map<String, Object> parameters = new HashMap<>();
+parameters.put(
+    AdbcDriver.PARAM_URL,
+    
"jdbc:postgresql://localhost:5432/postgres?user=postgres&password=password");
+try (final AdbcDatabase database = JdbcDriver.INSTANCE.open(parameters);
+    final AdbcConnection connection = database.connect();
+    final AdbcStatement stmt = connection.createStatement()) {
+  stmt.setSqlQuery("SELECT * FROM " + tableName);
+  try (AdbcStatement.QueryResult queryResult = stmt.executeQuery()) {
+    while (queryResult.getReader().loadNextBatch()) {
+      // Process the results
+    }
+  }
+}
+```
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCFlow1.svg" width="90%" 
class="img-responsive" alt="A diagram showing the query execution flow.">
+  <figcaption>The query execution flow.</figcaption>
+</figure>
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow based.
+The application then links to or loads drivers for the actual database, that 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCFlow2.svg" alt="A diagram showing the 
query execution flow with ADBC." width="90%" class="img-responsive">
+  <figcaption>The query execution flow with two different ADBC 
drivers.</figcaption>
+</figure>
+
+1. The application submits a SQL query via the ADBC APIs.
+2. The query is passed on to the ADBC driver.
+3. The driver translates the query to a database-specific protocol and sends 
the query to the database.
+4. The database executes the query and returns the result set in a 
database-specific format, which is ideally Arrow data.
+5. The driver translates the result format into Arrow data if needed.
+
+So either way, the client can use a single API and get Arrow data in all cases.
+
+## In More Detail
+
+As mentioned, ADBC is an API standard.
+You can read the [C/C++ header][adbc.h], [Go interfaces][adbc-go], and the 
[Java interfaces][adbc-core] for yourself.
+The project has built several libraries around this standard:
+
+- Driver implementations implement the API standard on top of a particular 
protocol or vendor.
+  See the drivers for [libpq (PostgreSQL)][adbc-libpq] and [JDBC][adbc-jdbc].
+- The driver manager library provides an easy way to use multiple drivers from 
a single program, and also load drivers at runtime if desired.
+
+This infrastructure is reusable by other drivers/vendors and accessible to 
other languages.
+For example, the Python bindings shown above are actually generic Python 
bindings for any ADBC driver, courtesy of the driver manager.
+
+ADBC builds on several existing Arrow projects.
+
+- The [Arrow C Data Interface][c-data-interface] provides a cross-language way 
to pass Arrow data between ADBC implementations and ADBC users.
+- [Flight SQL][flight-sql] is used to implement an ADBC driver, making Flight 
SQL accessible to languages like Python.
+- [Nanoarrow][nanoarrow] is used to implement some ADBC drivers, making them 
lightweight and independent of the core Arrow libraries.
+
+Also, ADBC provides API support for the emerging [Substrait][substrait] 
specification.
+Applications can pass queries either as strings (of SQL) or binary blobs (of 
serialized Substrait plans).
+
+## What about {Flight RPC, Flight SQL, JDBC, ODBC, …}?
+
+ADBC fills a specific niche that related projects do not address. It is both:
+
+- **Arrow-native**: ADBC can pass through Arrow data with no overhead thanks 
to the [C Data Interface][c-data-interface].
+  JDBC is row-oriented, and ODBC has implementation caveats, as discussed, 
that make it hard to use with Arrow.
+- **Vendor-agnostic**: ADBC drivers can implement the API using any underlying 
protocol, while Flight SQL requires server-side support that may not be easy to 
add.
+
+Plotted along these axes, we see that ADBC is both vendor-agnostic and 
Arrow-native, while Flight SQL is vendor-specific, and JDBC/ODBC are 
row-oriented:
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCQuadrants.svg"
+       alt="ADBC is Arrow-native and database-agnostic, while Flight SQL is 
Arrow-native and database-specific. JDBC and ODBC are row-oriented and 
database-agnostic, and wire protocols like Postgres's and SQL Server's are 
row-oriented and database-specific."
+       width="90%" class="img-responsive">
+</figure>

Review Comment:
   The figure is helpful, but the details are a little confusing. Is 
Arrow-native the opposite of row-oriented? Is JDBC less row-oriented than ODBC? 
Based on the text, it seems like ODBC is actually closer to the columnar side. 



##########
_posts/2022-12-31-arrow-adbc.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Arrow community has accepted version 1.0.0 of the [Arrow Database 
Connectivity (ADBC)][adbc] specification.
+ADBC is an API standard for Arrow-based database access.
+It defines abstract APIs in C, Go, and Java for performing common database 
tasks using Arrow data, like executing queries and getting basic metadata.
+
+Just like time-tested [JDBC][jdbc] and [ODBC][odbc], ADBC defines 
database-independent interaction APIs, and relies on drivers to implement those 
APIs for particular databases.
+It aims to provide a single API that works with Arrow-native protocols, like 
[Arrow Flight SQL][flight-sql]; vendor-specific APIs that offer Arrow data, 
such as those offered by ClickHouse or Google BigQuery; and non-columnar 
protocols and APIs like the PostgreSQL wire format or JDBC.
+
+In other words: **ADBC is a single API for getting Arrow data in and out of 
databases**.
+Underneath, ADBC driver implementations take care of bridging the actual 
system.
+Arrow-native systems can directly pass through data without conversion, while 
the driver takes care of converting data in other cases, saving the application 
from doing the work.
+
+## Examples
+
+*Note: implementations are still under development, separate from the API 
standard itself. Examples are subject to change.*
+
+We can get Arrow data out of a Postgres database in a few lines of Python:
+
+```python
+import adbc_driver_postgresql.dbapi
+
+uri = "postgresql://localhost:5432/postgres?user=postgres&password=password"
+with adbc_driver_postgresql.dbapi.connect(uri) as conn:
+    with conn.cursor() as cur:
+        cur.execute("SELECT * FROM customer")
+        table = cur.fetch_arrow_table()
+        # Process the results
+```
+
+The ADBC Python packages offer a familiar [DBAPI 2.0 (PEP 249)][pep-249]-style 
API, along with extensions to get Arrow data.
+
+In Java, we can pull Arrow data out of a JDBC connection.
+The ADBC driver takes care of converting the data for the application:
+
+```java
+final Map<String, Object> parameters = new HashMap<>();
+parameters.put(
+    AdbcDriver.PARAM_URL,
+    
"jdbc:postgresql://localhost:5432/postgres?user=postgres&password=password");
+try (final AdbcDatabase database = JdbcDriver.INSTANCE.open(parameters);
+    final AdbcConnection connection = database.connect();
+    final AdbcStatement stmt = connection.createStatement()) {
+  stmt.setSqlQuery("SELECT * FROM " + tableName);
+  try (AdbcStatement.QueryResult queryResult = stmt.executeQuery()) {
+    while (queryResult.getReader().loadNextBatch()) {
+      // Process the results
+    }
+  }
+}
+```
+
+## Motivation
+
+Applications often use API standards like JDBC and ODBC to work with databases.
+This lets them use the same API regardless of the underlying database, saving 
on development time.
+Roughly speaking, when an application executes a query with these APIs:
+
+1. The application submits a SQL query via the JDBC/ODBC APIs.
+2. The query is passed on to the driver.
+3. The driver translates the query to a database-specific protocol and sends 
it to the database.
+4. The database executes the query and returns the result set in a 
database-specific format.
+5. The driver translates the result format into the JDBC/ODBC API.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCFlow1.svg" width="90%" 
class="img-responsive" alt="A diagram showing the query execution flow.">
+  <figcaption>The query execution flow.</figcaption>
+</figure>
+
+When columnar data comes into play, however, problems arise.
+JDBC is a row-oriented API, and while ODBC can support columnar data, the type 
system and data representation is not a perfect match with Arrow.
+In both cases, this leads to data conversions around steps 4–5, spending 
resources without performing "useful" work.
+
+This mismatch is important for columnar database systems, such as ClickHouse, 
Dremio, DuckDB, Google BigQuery, and others.
+Clients, such as Apache Spark and pandas, would like to get columnar data 
directly from these systems.
+Meanwhile, traditional database systems aren't going away, and these clients 
still want to consume data from them.
+
+In response, we've seen a few solutions:
+
+- *Just provide JDBC/ODBC drivers*.
+  These standards are here to stay, and it makes sense to provide these 
interfaces for applications that want them.
+  But if both sides are columnar, that means converting data into rows for 
JDBC/ODBC, only for the client to convert them back into columns!
+- *Provide converters from JDBC/ODBC to Arrow*.
+  Some examples include [Turbodbc][turbodbc] and [arrow-jdbc][arrow-jdbc].
+  This approach reduces the burden on client applications, but doesn't 
fundamentally solve the problem.
+  Unnecessary data conversions are still required.
+- *Provide special SDKs*.
+  All of the columnar systems listed above do offer ways to get Arrow data, 
such as via [Arrow Flight SQL][flight-sql].
+  But client applications need to spend time to integrate with each of them.
+  (Just look at all the 
[connectors](https://trino.io/docs/current/connector.html) that Trino 
implements.)
+  And not every system offers this option.
+
+ADBC combines the advantages of the latter two solutions under one API.
+In other words, ADBC provides a set of API definitions that client 
applications code to.
+These API definitions are Arrow based.
+The application then links to or loads drivers for the actual database, that 
implement the API definitions.
+If the database is Arrow-native, the driver can just pass the data through 
without conversion.
+Otherwise, the driver converts the data to Arrow format first.
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCFlow2.svg" alt="A diagram showing the 
query execution flow with ADBC." width="90%" class="img-responsive">
+  <figcaption>The query execution flow with two different ADBC 
drivers.</figcaption>
+</figure>
+
+1. The application submits a SQL query via the ADBC APIs.
+2. The query is passed on to the ADBC driver.
+3. The driver translates the query to a database-specific protocol and sends 
the query to the database.
+4. The database executes the query and returns the result set in a 
database-specific format, which is ideally Arrow data.
+5. The driver translates the result format into Arrow data if needed.
+
+So either way, the client can use a single API and get Arrow data in all cases.
+
+## In More Detail
+
+As mentioned, ADBC is an API standard.
+You can read the [C/C++ header][adbc.h], [Go interfaces][adbc-go], and the 
[Java interfaces][adbc-core] for yourself.
+The project has built several libraries around this standard:
+
+- Driver implementations implement the API standard on top of a particular 
protocol or vendor.
+  See the drivers for [libpq (PostgreSQL)][adbc-libpq] and [JDBC][adbc-jdbc].
+- The driver manager library provides an easy way to use multiple drivers from 
a single program, and also load drivers at runtime if desired.
+
+This infrastructure is reusable by other drivers/vendors and accessible to 
other languages.
+For example, the Python bindings shown above are actually generic Python 
bindings for any ADBC driver, courtesy of the driver manager.
+
+ADBC builds on several existing Arrow projects.
+
+- The [Arrow C Data Interface][c-data-interface] provides a cross-language way 
to pass Arrow data between ADBC implementations and ADBC users.
+- [Flight SQL][flight-sql] is used to implement an ADBC driver, making Flight 
SQL accessible to languages like Python.
+- [Nanoarrow][nanoarrow] is used to implement some ADBC drivers, making them 
lightweight and independent of the core Arrow libraries.
+
+Also, ADBC provides API support for the emerging [Substrait][substrait] 
specification.
+Applications can pass queries either as strings (of SQL) or binary blobs (of 
serialized Substrait plans).
+
+## What about {Flight RPC, Flight SQL, JDBC, ODBC, …}?
+
+ADBC fills a specific niche that related projects do not address. It is both:
+
+- **Arrow-native**: ADBC can pass through Arrow data with no overhead thanks 
to the [C Data Interface][c-data-interface].
+  JDBC is row-oriented, and ODBC has implementation caveats, as discussed, 
that make it hard to use with Arrow.
+- **Vendor-agnostic**: ADBC drivers can implement the API using any underlying 
protocol, while Flight SQL requires server-side support that may not be easy to 
add.
+
+Plotted along these axes, we see that ADBC is both vendor-agnostic and 
Arrow-native, while Flight SQL is vendor-specific, and JDBC/ODBC are 
row-oriented:
+
+<figure style="text-align: center;">
+  <img src="{{ site.baseurl }}/img/ADBCQuadrants.svg"
+       alt="ADBC is Arrow-native and database-agnostic, while Flight SQL is 
Arrow-native and database-specific. JDBC and ODBC are row-oriented and 
database-agnostic, and wire protocols like Postgres's and SQL Server's are 
row-oriented and database-specific."
+       width="90%" class="img-responsive">
+</figure>
+
+**ADBC doesn't intend to replace JDBC or ODBC in general**, but we think ADBC 
makes more sense for applications that just want bulk columnar data access.
+

Review Comment:
   This seems almost like the benefits statement: Eg. "ADBC is an alternative 
to JDBC or ODBC for applications that need efficient bulk columnar data 
access...." Or even, "...an efficient, columnar alternative to JDBC/ODBC for 
analytical (data science?) applications." 



##########
_posts/2022-12-31-arrow-adbc.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: post
+title: "Introducing ADBC: Database Access for Apache Arrow"
+date: "2022-12-31 00:00:00"
+author: pmc
+categories: [application]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Arrow community has accepted version 1.0.0 of the [Arrow Database 
Connectivity (ADBC)][adbc] specification.
+ADBC is an API standard for Arrow-based database access.
+It defines abstract APIs in C, Go, and Java for performing common database 
tasks using Arrow data, like executing queries and getting basic metadata.

Review Comment:
   Hmmm. if only there was a way to use this from Python. (I see below that 
there is, but I might have stopped reading here if that was the only language I 
care about.)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-site] lwhite1 commented on a diff in pull request #248: [Website] Add ADBC blog post

Reply via email to