Re: [DISC] Improving Arrow's database support

David Li Thu, 08 Sep 2022 09:05:43 -0700

Thanks all, I've updated the header with the proposed versioning scheme.

At this point I believe the core definitions are ready. (Note that I'm 
explicitly punting on [1][2][3] here.) Absent further comments, I'd like to do 
the following:


- Start a vote on mirroring adbc.h to arrow/format, as well adding 
docs/source/format/ADBC.rst that describes the header, the Java interface, the 
Go interface, and the versioning scheme (I will put up a PR beforehand)
- Begin work on CI/packaging, with a release hopefully coinciding with Arrow 
10.0.0
- Begin work on changes to the main repository, also hopefully in time for 
10.0.0 (moving the Flight SQL driver to be part of apache/arrow; exposing it in 
PyArrow; possibly also exposing Acero via ADBC)

[1]: https://github.com/apache/arrow-adbc/issues/46
[2]: https://github.com/apache/arrow-adbc/issues/55
[3]: https://github.com/apache/arrow-adbc/issues/59

On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
> +1 from me on the strategy proposed by Kou.
>
> That would be my preference also. I agree it is preferable to be versioned
> independently.
>
> --Matt
>
> On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <[email protected]> wrote:
>
>> Hi,
>>
>> > Do we have a preference for versioning strategy? Should we
>> > proceed in lockstep with the Arrow C++ library et. al. and
>> > release "ADBC 1.0.0" (the API standard) with "drivers
>> > version 10.0.0", or use an independent versioning scheme?
>> > (For example, release API standard and components at
>> > "1.0.0". Then further releases of components that do not
>> > change the spec would be "1.1", "1.2", ...; if/when we
>> > change the spec, start over with "2.0", "2.1", ...)
>>
>> I like an independent versioning schema. I assume that ADBC
>> doesn't need backward incompatible changes frequently. How
>> about incrementing major version only when ADBC needs
>> any backward incompatible changes?
>>
>> e.g.:
>>
>>   1.  Release ADBC (the API standard) 1.0.0
>>   2.  Release adbc_driver_manager 1.0.0
>>   3.  Release adbc_driver_postgres 1.0.0
>>   4.  Add a new feature to adbc_driver_postgres without
>>       any backward incompatible changes
>>   5.  Release adbc_driver_postgres 1.1.0
>>   6.  Fix a bug in adbc_driver_manager without
>>       any backward incompatible changes
>>   7.  Release adbc_driver_manager 1.0.1
>>   8.  Add a backward incompatible change to adbc_driver_manager
>>   9.  Release adbc_driver_manager 2.0.0
>>   10. Add a new feature to ADBC without any
>>       backward incompatible changes
>>   11. Release ADBC (the API standard) 1.1.0
>>
>>
>> Thanks,
>> --
>> kou
>>
>> In <[email protected]>
>>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 2022
>> 16:36:43 -0400,
>>   "David Li" <[email protected]> wrote:
>>
>> > Following up here with some specific questions:
>> >
>> > Matt Topol added some Go definitions [1] (thanks!) I'd assume we want to
>> vote on those as well?
>> >
>> > How should the process work for Java/Go? For C/C++, I assume we'd treat
>> it like the C Data Interface and copy adbc.h to format/ after a vote, and
>> then vote on releases of components. Or do we really only consider the C
>> header as the 'format', with the others being language-specific affordances?
>> >
>> > What about for Java and for Go? We could vote on and tag a release for
>> Go, and add a documentation page that links to the Java/Go definitions at a
>> specific revision (as the equivalent 'format' definition for Java/Go)? Or
>> would we vendor the entire Java module/Go package as the 'format'?
>> >
>> > Do we have a preference for versioning strategy? Should we proceed in
>> lockstep with the Arrow C++ library et. al. and release "ADBC 1.0.0" (the
>> API standard) with "drivers version 10.0.0", or use an independent
>> versioning scheme? (For example, release API standard and components at
>> "1.0.0". Then further releases of components that do not change the spec
>> would be "1.1", "1.2", ...; if/when we change the spec, start over with
>> "2.0", "2.1", ...)
>> >
>> > [1]: https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go
>> >
>> > -David
>> >
>> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>> >> Hi,
>> >>
>> >> OK. I'll send pull requests for GLib and Ruby soon.
>> >>
>> >>> I'm curious if you have a particular use case in mind.
>> >>
>> >> I don't have any production-ready use case yet but I want to
>> >> implement an Active Record adapter for ADBC. Active Record
>> >> is the O/R mapper for Ruby on Rails. Implementing Web
>> >> application by Ruby on Rails is one of major Ruby use
>> >> cases. So providing Active Record interface for ADBC will
>> >> increase Apache Arrow users in Ruby community.
>> >>
>> >> NOTE: Generally, Ruby on Rails users don't process large
>> >> data but they sometimes need to process large (medium?) data
>> >> in a batch process. Active Record adapter for ADBC may be
>> >> useful for such use case.
>> >>
>> >>> There's a little bit more API cleanup to do [1]. If you
>> >>> have comments on that or anything else, I'd appreciate
>> >>> them. Otherwise, pull requests would also be appreciated.
>> >>
>> >> OK. I'll open issues/pull requests when I find
>> >> something. For now, I think that "MODULE" type library
>> >> instead of "SHARED" type library in CMake terminology
>> >> [cmake] is better for driver modules. (I'll open an issue
>> >> for this later.)
>> >>
>> >> [cmake]: https://cmake.org/cmake/help/latest/command/add_library.html
>> >>
>> >>
>> >> Thanks,
>> >> --
>> >> kou
>> >>
>> >> In <[email protected]>
>> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 Aug 2022
>> >> 15:28:56 -0400,
>> >>   "David Li" <[email protected]> wrote:
>> >>
>> >>> I would be very happy to see GLib/Ruby bindings! I'm curious if you
>> have a particular use case in mind.
>> >>>
>> >>> There's a little bit more API cleanup to do [1]. If you have comments
>> on that or anything else, I'd appreciate them. Otherwise, pull requests
>> would also be appreciated.
>> >>>
>> >>> [1]: https://github.com/apache/arrow-adbc/issues/79
>> >>>
>> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>> >>>> Hi,
>> >>>>
>> >>>> Thanks for sharing the current status!
>> >>>> I understand.
>> >>>>
>> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>> >>>> before we release the first version? (I want to use ADBC
>> >>>> from Ruby.) Or should I wait for the first release? If I can
>> >>>> work on it now, I'll open pull requests for it.
>> >>>>
>> >>>> Thanks,
>> >>>> --
>> >>>> kou
>> >>>>
>> >>>> In <[email protected]>
>> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 26 Aug 2022
>> >>>> 11:03:26 -0400,
>> >>>>   "David Li" <[email protected]> wrote:
>> >>>>
>> >>>>> Thank you Kou!
>> >>>>>
>> >>>>> At least initially, I don't think I'll be able to complete the
>> Dataset integration in time. So 10.0.0 probably won't ship with a hard
>> dependency. That said I am hoping to have PyArrow take an optional
>> dependency (so Flight SQL can finally be available from Python).
>> >>>>>
>> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>> >>>>>> Hi,
>> >>>>>>
>> >>>>>> As a maintainer of Linux packages, I want apache/arrow-adbc
>> >>>>>> to be released before apache/arrow is released so that
>> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>> >>>>>> .deb/.rpm.
>> >>>>>>
>> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>> >>>>>>
>> >>>>>> We can add .deb/.rpm related files
>> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>> >>>>>> apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.
>> >>>>>>
>> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>> >>>>>>
>> >>>>>> *
>> https://github.com/datafusion-contrib/datafusion-c/tree/main/package
>> >>>>>> *
>> >>>>>>
>> https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml
>> >>>>>>
>> >>>>>> I can work on it in apache/arrow-adbc.
>> >>>>>>
>> >>>>>>
>> >>>>>> Thanks,
>> >>>>>> --
>> >>>>>> kou
>> >>>>>>
>> >>>>>> In <[email protected]>
>> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug
>> 2022
>> >>>>>> 11:51:08 -0400,
>> >>>>>>   "David Li" <[email protected]> wrote:
>> >>>>>>
>> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry for the
>> wall of text that follows…)
>> >>>>>>>
>> >>>>>>> These are the components:
>> >>>>>>>
>> >>>>>>> - Core adbc.h header
>> >>>>>>> - Driver manager for C/C++
>> >>>>>>> - Flight SQL-based driver
>> >>>>>>> - Postgres-based driver (WIP)
>> >>>>>>> - SQLite-based driver (more of a testbed for me than an actual
>> component - I don't think we'd actually distribute this)
>> >>>>>>> - Java core interfaces
>> >>>>>>> - Java driver manager
>> >>>>>>> - Java JDBC-based driver
>> >>>>>>> - Java Flight SQL-based driver
>> >>>>>>> - Python driver manager
>> >>>>>>>
>> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL
>> drivers get moved to the main Arrow repo and distributed as part of the
>> regular Arrow releases.
>> >>>>>>>
>> >>>>>>> For the rest of the components: they could be packaged
>> individually, but versioned and released together. Also, each C/C++ driver
>> probably needs a corresponding Python package so Python users do not have
>> to futz with shared library configurations. (See [1].) So for instance,
>> installing PyArrow would also give you the Flight SQL driver, and `pip
>> install adbc_postgres` would get you the Postgres-based driver.
>> >>>>>>>
>> >>>>>>> That would mean setting up separate CI, release, etc. (and
>> eventually linking Crossbow & Conbench as well?). That does mean
>> duplication of effort, but the trade off is avoiding bloating the main
>> release process even further. However, I'd like to hear from those closer
>> to the release process on this subject - if it would make people's lives
>> easier, we could merge everything into one repo/process.
>> >>>>>>>
>> >>>>>>> Integrations would be distributed as part of their respective
>> packages (e.g. Arrow Dataset would optionally link to the driver manager).
>> So the "part of Arrow 10.0.0" aspect means having a stable interface for
>> adbc.h, and getting the Flight SQL drivers into the main repo.
>> >>>>>>>
>> >>>>>>> [1]: https://github.com/apache/arrow-adbc/issues/53
>> >>>>>>>
>> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>> >>>>>>>> "David Li" <[email protected]> wrote:
>> >>>>>>>>> Since it's been a while, I'd like to give an update. There are
>> also a few questions I have around distribution.
>> >>>>>>>>>
>> >>>>>>>>> Currently:
>> >>>>>>>>> - Supported in C, Java, and Python.
>> >>>>>>>>> - For C/Python, there are basic drivers wrapping Flight SQL and
>> SQLite, with a draft of a libpq (Postgres) driver (using nanoarrow).
>> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>> >>>>>>>>> - For Python, there's low-level bindings to the C API, and the
>> DBAPI interface on top of that (+a few extension methods resembling
>> DuckDB/Turbodbc).
>> >>>>>>>>>
>> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), and
>> DuckDB. (I'd like to thank Hannes and Kirill for their comments, as well as
>> Antoine, Dewey, and Matt here.)
>> >>>>>>>>>
>> >>>>>>>>> I'd like to have this as part of 10.0.0 in some fashion.
>> However, I'm not sure how we would like to handle packaging and
>> distribution. In particular, there are several sub-components for each
>> language (the driver manager + the drivers), increasing the work. Any
>> thoughts here?
>> >>>>>>>>
>> >>>>>>>> Sorry, forgot to answer here. But I think your question is too
>> broadly
>> >>>>>>>> formulated. It probably deserves a case-by-case discussion, IMHO.
>> >>>>>>>>
>> >>>>>>>>> I'm also wondering how we want to handle this in terms of
>> specification - I assume we'd consider the core header file/Java interfaces
>> a spec like the C Data Interface/Flight RPC, and vote on them/mirror them
>> into the format/ directory?
>> >>>>>>>>
>> >>>>>>>> That sounds like the right way to me indeed.
>> >>>>>>>>
>> >>>>>>>> Regards
>> >>>>>>>>
>> >>>>>>>> Antoine.
>>

Re: [DISC] Improving Arrow's database support

Reply via email to