Re: [DISC] Improving Arrow's database support

David Li Fri, 19 Aug 2022 11:11:45 -0700

Since it's been a while, I'd like to give an update. There are also a few 
questions I have around distribution.


Currently:
- Supported in C, Java, and Python.
- For C/Python, there are basic drivers wrapping Flight SQL and SQLite, with a 
draft of a libpq (Postgres) driver (using nanoarrow).
- For Java, there are drivers wrapping JDBC and Flight SQL.
- For Python, there's low-level bindings to the C API, and the DBAPI interface 
on top of that (+a few extension methods resembling DuckDB/Turbodbc).
 
There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like to 
thank Hannes and Kirill for their comments, as well as Antoine, Dewey, and Matt 
here.)

I'd like to have this as part of 10.0.0 in some fashion. However, I'm not sure 
how we would like to handle packaging and distribution. In particular, there 
are several sub-components for each language (the driver manager + the 
drivers), increasing the work. Any thoughts here?

I'm also wondering how we want to handle this in terms of specification - I 
assume we'd consider the core header file/Java interfaces a spec like the C 
Data Interface/Flight RPC, and vote on them/mirror them into the format/ 
directory?

I'm hoping that longer term, most of the drivers would be maintained outside 
the community, and we would just distribute the driver managers and 'core' 
drivers (Flight SQL, probably Acero and JDBC/ODBC wrappers). There's also a lot 
of potential follow-up work, including integration into more systems (e.g. 
Arrow Dataset/Acero, pandas.read_sql, Spark DataSourceV2), more drivers (e.g. 
recycling pgeon/pg2arrow, Turbodbc, and/or the Arrow Hiveserver client; 
FreeTDS/SQL Server; BigQuery Storage; etc.); setting up benchmarks and 
integration tests, etc.

[1]: https://github.com/ibis-project/ibis/pull/4267

-David

On Wed, Jun 1, 2022, at 17:52, David Li wrote:
> I've set up the new repo and enabled issues. I still need to get things 
> building independently of Arrow, but now adbc.h is self-contained and 
> the "driver manager" being prototyped can also be built and used 
> independently of Arrow.
>
> On Wed, Jun 1, 2022, at 13:55, David Li wrote:
>> Wes: thanks! I'll move things over and update the list.
>>
>> Gavin: I mean more that ADBC won't support every little feature in 
>> JDBC/ODBC, or won't necessarily make it easy to support certain things 
>> (e.g. updating a single row in a ResultSet). But it's not that OLTP is 
>> taboo, it's just not what is being optimized for. 
>>
>> For instance it would be nice to eventually have JDBC/ODBC drivers that 
>> can wrap ADBC in much the same way that Dremio is working on a JDBC 
>> driver for Flight SQL. But especially in the near term, ADBC just won't 
>> have the feature set to make that possible.
>>
>> What sorts of use cases were you thinking about, though?
>>
>> On Wed, Jun 1, 2022, at 13:18, Gavin Ray wrote:
>>> This sounds great, but I had one question:
>>>
>>> Read the initial ADBC proposal and it mentioned that OLTP was not a
>>> targeted usecase
>>> If this work is intended to take on the role of a sort of standard ABI/SDK,
>>> does that mean that building OLTP-oriented drivers/tooling with it is off
>>> the table?
>>>
>>> On Wed, Jun 1, 2022 at 11:11 AM Wes McKinney <[email protected]> wrote:
>>>
>>>> I went ahead and created
>>>>
>>>> https://github.com/apache/arrow-adbc
>>>>
>>>> I directed issue comments / PRs to issues@
>>>>
>>>> On Tue, May 31, 2022 at 8:49 PM Wes McKinney <[email protected]> wrote:
>>>> >
>>>> > I think spinning up a new repository while this exploratory work
>>>> > progresses is a fine idea — perhaps apache/arrow-dbc / arrow-adbc or
>>>> > similar (the name can always be changed later). That would bubble up
>>>> > discussions in a way that's easier for people to follow (watching your
>>>> > fork isn't ideal!). If it makes sense to move code later, it can
>>>> > always be moved.
>>>> >
>>>> >
>>>> > On Tue, May 31, 2022 at 1:02 PM David Li <[email protected]> wrote:
>>>> > >
>>>> > > Some updates:
>>>> > >
>>>> > > The proposal is being updated based on feedback from contributors to
>>>> DuckDB and DBI. We've been using GitHub issues on the fork to discuss the
>>>> API design and how to implement data ingestion/bound parameters:
>>>> https://github.com/lidavidm/arrow/issues
>>>> > >
>>>> > > If anyone has suggestions/ideas/questions, or would like to jump in as
>>>> well, please feel free to chime in there too.
>>>> > >
>>>> > > I have also been wondering if we might want to plan to split off a new
>>>> repo for this work? In particular, some components might be easiest to
>>>> consume if they didn't also have a hard dependency on the Arrow C++
>>>> libraries. And we could use the repo to manage contributed drivers (some of
>>>> which may individually leverage the Arrow libraries). Of course,
>>>> maintaining a parallel build system, setting up releases, etc. is also a
>>>> lot of work.
>>>> > >
>>>> > > -David
>>>> > >
>>>> > > On Tue, Apr 26, 2022, at 15:01, Wes McKinney wrote:
>>>> > > > I don't have major new things to add on this topic except that I've
>>>> > > > long had the aspiration of creating something like Python's DBAPI 2.0
>>>> > > > [1] at the C or C++ level to enable a measure of API standardization
>>>> > > > for Arrow-native read/write interfaces with database drivers. It
>>>> seems
>>>> > > > like a natural complement to the wire-protocol standardization work
>>>> > > > with FlightSQL. I had previously brought in some code that I had
>>>> > > > worked on related to interfacing with the HiveServer2 wire protocol
>>>> > > > (for Hive and Impala, or other HS2-compatible query engines) with the
>>>> > > > intention of prototyping but never was able to find the time.
>>>> > > >
>>>> > > > From an external messaging standpoint, one thing that will be
>>>> > > > important is to assert that this is not intended to displace or
>>>> > > > deprecate ODBC or JDBC drivers. In fact, I would hope that the
>>>> > > > Arrow-native APIs could be added somehow to existing driver libraries
>>>> > > > where it made sense, so that if they are used in an application that
>>>> > > > uses Arrow, they can opt in to using the Arrow-based APIs for getting
>>>> > > > result sets, or doing bulk inserts, etc.
>>>> > > >
>>>> > > > [1]: https://peps.python.org/pep-0249/
>>>> > > >
>>>> > > > On Tue, Apr 26, 2022 at 12:36 PM Antoine Pitrou <[email protected]>
>>>> wrote:
>>>> > > >>
>>>> > > >>
>>>> > > >> Do we want something more flexible than dlopen() and runtime symbol
>>>> > > >> lookup (a mechanism which constrains the way you can organize and
>>>> > > >> distribute drivers)?
>>>> > > >>
>>>> > > >> For example, perhaps we could expose an API struct of function
>>>> pointers
>>>> > > >> that could be obtained through driver-specific means.
>>>> > > >>
>>>> > > >>
>>>> > > >> Le 26/04/2022 à 18:29, David Li a écrit :
>>>> > > >> > Hello,
>>>> > > >> >
>>>> > > >> > In light of recent efforts around Flight SQL, projects like pgeon
>>>> [1], and long-standing tickets/discussions about database support in Arrow
>>>> [2], it seems there's an opportunity to define standard database interfaces
>>>> for Arrow that could unify these efforts. So we've put together a proposal
>>>> for "ADBC", a common Arrow-based database client API:
>>>> > > >> >
>>>> > > >> >
>>>> https://docs.google.com/document/d/1t7NrC76SyxL_OffATmjzZs2xcj1owdUsIF2WKL_Zw1U/edit#heading=h.r6o6j2navi4c
>>>> > > >> >
>>>> > > >> > A common API and implementations could help combine/simplify
>>>> client-side projects like pgeon, or what DBI is considering [3], and help
>>>> them take advantage of developments like Flight SQL and existing columnar
>>>> APIs.
>>>> > > >> >
>>>> > > >> > We'd appreciate any feedback. (Comments should be open, please
>>>> let me know if not.)
>>>> > > >> >
>>>> > > >> > [1]: https://github.com/0x0L/pgeon
>>>> > > >> > [2]: https://issues.apache.org/jira/browse/ARROW-11670
>>>> > > >> > [3]: https://github.com/r-dbi/dbi3/issues/48
>>>> > > >> >
>>>> > > >> > Thanks,
>>>> > > >> > David
>>>>

Re: [DISC] Improving Arrow's database support

Reply via email to