Re: [DISC] Improving Arrow's database support

David Li Mon, 12 Sep 2022 09:44:40 -0700

I like this idea. I would also like to set up some sort of automated ABI 
checker as well (the options I found were GPL/LGPL so I need to figure out how 
to proceed).


I can put up a PR later that formalizes these guidelines in CONTRIBUTING.md. It 
looks like there's a pre-commit hook for this sort of thing too, which'll let 
us enforce it in CI!

On Mon, Sep 12, 2022, at 10:18, Matthew Topol wrote:
> Automated semver would be ideal if we can do it.....
>
> There's quite a lot of utilities that exist which would automatically 
> handle the versioning if we're using conventional commits.
>
> On Mon, Sep 12 2022 at 02:26:15 PM +0200, Jacob Wujciak 
> <[email protected]> wrote:
>> + 1 to independent, semver versioning for adbc.
>> I would propose we use conventional commit style [1] commit messages 
>> for
>> the pr commits (I assume squash + merge) so we can automate the
>> versioning|double check manual versioning.
>> 
>> [1]: <https://www.conventionalcommits.org/>
>> 
>> On Thu, Sep 8, 2022 at 6:05 PM David Li <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>>>  Thanks all, I've updated the header with the proposed versioning 
>>> scheme.
>>> 
>>>  At this point I believe the core definitions are ready. (Note that 
>>> I'm
>>>  explicitly punting on [1][2][3] here.) Absent further comments, I'd 
>>> like to
>>>  do the following:
>>> 
>>>  - Start a vote on mirroring adbc.h to arrow/format, as well adding
>>>  docs/source/format/ADBC.rst that describes the header, the Java 
>>> interface,
>>>  the Go interface, and the versioning scheme (I will put up a PR 
>>> beforehand)
>>>  - Begin work on CI/packaging, with a release hopefully coinciding 
>>> with
>>>  Arrow 10.0.0
>>>  - Begin work on changes to the main repository, also hopefully in 
>>> time for
>>>  10.0.0 (moving the Flight SQL driver to be part of apache/arrow; 
>>> exposing
>>>  it in PyArrow; possibly also exposing Acero via ADBC)
>>> 
>>>  [1]: <https://github.com/apache/arrow-adbc/issues/46>
>>>  [2]: <https://github.com/apache/arrow-adbc/issues/55>
>>>  [3]: <https://github.com/apache/arrow-adbc/issues/59>
>>> 
>>>  On Sat, Sep 3, 2022, at 18:36, Matthew Topol wrote:
>>>  > +1 from me on the strategy proposed by Kou.
>>>  >
>>>  > That would be my preference also. I agree it is preferable to be
>>>  versioned
>>>  > independently.
>>>  >
>>>  > --Matt
>>>  >
>>>  > On Sat, Sep 3, 2022, 6:24 PM Sutou Kouhei <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>  >
>>>  >> Hi,
>>>  >>
>>>  >> > Do we have a preference for versioning strategy? Should we
>>>  >> > proceed in lockstep with the Arrow C++ library et. al. and
>>>  >> > release "ADBC 1.0.0" (the API standard) with "drivers
>>>  >> > version 10.0.0", or use an independent versioning scheme?
>>>  >> > (For example, release API standard and components at
>>>  >> > "1.0.0". Then further releases of components that do not
>>>  >> > change the spec would be "1.1", "1.2", ...; if/when we
>>>  >> > change the spec, start over with "2.0", "2.1", ...)
>>>  >>
>>>  >> I like an independent versioning schema. I assume that ADBC
>>>  >> doesn't need backward incompatible changes frequently. How
>>>  >> about incrementing major version only when ADBC needs
>>>  >> any backward incompatible changes?
>>>  >>
>>>  >> e.g.:
>>>  >>
>>>  >>   1.  Release ADBC (the API standard) 1.0.0
>>>  >>   2.  Release adbc_driver_manager 1.0.0
>>>  >>   3.  Release adbc_driver_postgres 1.0.0
>>>  >>   4.  Add a new feature to adbc_driver_postgres without
>>>  >>       any backward incompatible changes
>>>  >>   5.  Release adbc_driver_postgres 1.1.0
>>>  >>   6.  Fix a bug in adbc_driver_manager without
>>>  >>       any backward incompatible changes
>>>  >>   7.  Release adbc_driver_manager 1.0.1
>>>  >>   8.  Add a backward incompatible change to adbc_driver_manager
>>>  >>   9.  Release adbc_driver_manager 2.0.0
>>>  >>   10. Add a new feature to ADBC without any
>>>  >>       backward incompatible changes
>>>  >>   11. Release ADBC (the API standard) 1.1.0
>>>  >>
>>>  >>
>>>  >> Thanks,
>>>  >> --
>>>  >> kou
>>>  >>
>>>  >> In <[email protected] 
>>> <mailto:[email protected]>>
>>>  >>   "Re: [DISC] Improving Arrow's database support" on Thu, 01 Sep 
>>> 2022
>>>  >> 16:36:43 -0400,
>>>  >>   "David Li" <[email protected] <mailto:[email protected]>> 
>>> wrote:
>>>  >>
>>>  >> > Following up here with some specific questions:
>>>  >> >
>>>  >> > Matt Topol added some Go definitions [1] (thanks!) I'd assume 
>>> we want
>>>  to
>>>  >> vote on those as well?
>>>  >> >
>>>  >> > How should the process work for Java/Go? For C/C++, I assume 
>>> we'd
>>>  treat
>>>  >> it like the C Data Interface and copy adbc.h to format/ after a 
>>> vote,
>>>  and
>>>  >> then vote on releases of components. Or do we really only 
>>> consider the C
>>>  >> header as the 'format', with the others being language-specific
>>>  affordances?
>>>  >> >
>>>  >> > What about for Java and for Go? We could vote on and tag a 
>>> release for
>>>  >> Go, and add a documentation page that links to the Java/Go 
>>> definitions
>>>  at a
>>>  >> specific revision (as the equivalent 'format' definition for 
>>> Java/Go)?
>>>  Or
>>>  >> would we vendor the entire Java module/Go package as the 
>>> 'format'?
>>>  >> >
>>>  >> > Do we have a preference for versioning strategy? Should we 
>>> proceed in
>>>  >> lockstep with the Arrow C++ library et. al. and release "ADBC 
>>> 1.0.0"
>>>  (the
>>>  >> API standard) with "drivers version 10.0.0", or use an 
>>> independent
>>>  >> versioning scheme? (For example, release API standard and 
>>> components at
>>>  >> "1.0.0". Then further releases of components that do not change 
>>> the spec
>>>  >> would be "1.1", "1.2", ...; if/when we change the spec, start 
>>> over with
>>>  >> "2.0", "2.1", ...)
>>>  >> >
>>>  >> > [1]: 
>>> <https://github.com/apache/arrow-adbc/blob/main/go/adbc/adbc.go>
>>>  >> >
>>>  >> > -David
>>>  >> >
>>>  >> > On Sun, Aug 28, 2022, at 10:56, Sutou Kouhei wrote:
>>>  >> >> Hi,
>>>  >> >>
>>>  >> >> OK. I'll send pull requests for GLib and Ruby soon.
>>>  >> >>
>>>  >> >>> I'm curious if you have a particular use case in mind.
>>>  >> >>
>>>  >> >> I don't have any production-ready use case yet but I want to
>>>  >> >> implement an Active Record adapter for ADBC. Active Record
>>>  >> >> is the O/R mapper for Ruby on Rails. Implementing Web
>>>  >> >> application by Ruby on Rails is one of major Ruby use
>>>  >> >> cases. So providing Active Record interface for ADBC will
>>>  >> >> increase Apache Arrow users in Ruby community.
>>>  >> >>
>>>  >> >> NOTE: Generally, Ruby on Rails users don't process large
>>>  >> >> data but they sometimes need to process large (medium?) data
>>>  >> >> in a batch process. Active Record adapter for ADBC may be
>>>  >> >> useful for such use case.
>>>  >> >>
>>>  >> >>> There's a little bit more API cleanup to do [1]. If you
>>>  >> >>> have comments on that or anything else, I'd appreciate
>>>  >> >>> them. Otherwise, pull requests would also be appreciated.
>>>  >> >>
>>>  >> >> OK. I'll open issues/pull requests when I find
>>>  >> >> something. For now, I think that "MODULE" type library
>>>  >> >> instead of "SHARED" type library in CMake terminology
>>>  >> >> [cmake] is better for driver modules. (I'll open an issue
>>>  >> >> for this later.)
>>>  >> >>
>>>  >> >> [cmake]:
>>>  <https://cmake.org/cmake/help/latest/command/add_library.html>
>>>  >> >>
>>>  >> >>
>>>  >> >> Thanks,
>>>  >> >> --
>>>  >> >> kou
>>>  >> >>
>>>  >> >> In <[email protected] 
>>> <mailto:[email protected]>>
>>>  >> >>   "Re: [DISC] Improving Arrow's database support" on Sat, 27 
>>> Aug 2022
>>>  >> >> 15:28:56 -0400,
>>>  >> >>   "David Li" <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>  >> >>
>>>  >> >>> I would be very happy to see GLib/Ruby bindings! I'm curious 
>>> if you
>>>  >> have a particular use case in mind.
>>>  >> >>>
>>>  >> >>> There's a little bit more API cleanup to do [1]. If you have
>>>  comments
>>>  >> on that or anything else, I'd appreciate them. Otherwise, pull 
>>> requests
>>>  >> would also be appreciated.
>>>  >> >>>
>>>  >> >>> [1]: <https://github.com/apache/arrow-adbc/issues/79>
>>>  >> >>>
>>>  >> >>> On Fri, Aug 26, 2022, at 21:53, Sutou Kouhei wrote:
>>>  >> >>>> Hi,
>>>  >> >>>>
>>>  >> >>>> Thanks for sharing the current status!
>>>  >> >>>> I understand.
>>>  >> >>>>
>>>  >> >>>> BTW, can I add GLib/Ruby bindings to apache/arrow-adbc
>>>  >> >>>> before we release the first version? (I want to use ADBC
>>>  >> >>>> from Ruby.) Or should I wait for the first release? If I can
>>>  >> >>>> work on it now, I'll open pull requests for it.
>>>  >> >>>>
>>>  >> >>>> Thanks,
>>>  >> >>>> --
>>>  >> >>>> kou
>>>  >> >>>>
>>>  >> >>>> In <[email protected] 
>>> <mailto:[email protected]>>
>>>  >> >>>>   "Re: [DISC] Improving Arrow's database support" on Fri, 
>>> 26 Aug
>>>  2022
>>>  >> >>>> 11:03:26 -0400,
>>>  >> >>>>   "David Li" <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>  >> >>>>
>>>  >> >>>>> Thank you Kou!
>>>  >> >>>>>
>>>  >> >>>>> At least initially, I don't think I'll be able to complete 
>>> the
>>>  >> Dataset integration in time. So 10.0.0 probably won't ship with 
>>> a hard
>>>  >> dependency. That said I am hoping to have PyArrow take an 
>>> optional
>>>  >> dependency (so Flight SQL can finally be available from Python).
>>>  >> >>>>>
>>>  >> >>>>> On Fri, Aug 26, 2022, at 01:01, Sutou Kouhei wrote:
>>>  >> >>>>>> Hi,
>>>  >> >>>>>>
>>>  >> >>>>>> As a maintainer of Linux packages, I want 
>>> apache/arrow-adbc
>>>  >> >>>>>> to be released before apache/arrow is released so that
>>>  >> >>>>>> apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
>>>  >> >>>>>> .deb/.rpm.
>>>  >> >>>>>>
>>>  >> >>>>>> (If Apache Arrow Dataset uses apache/arrow-adbc,
>>>  >> >>>>>> apache/arrow's .deb/.rpm needs to depend on
>>>  >> >>>>>> apache/arrow-adbc's .deb/.rpm.)
>>>  >> >>>>>>
>>>  >> >>>>>> We can add .deb/.rpm related files
>>>  >> >>>>>> (dev/tasks/linux-packages/ in apache/arrow) to
>>>  >> >>>>>> apache/arrow-adbc to build .deb/.rpm for 
>>> apache/arrow-adbc.
>>>  >> >>>>>>
>>>  >> >>>>>> FYI: I did it for datafusion-contrib/datafusion-c:
>>>  >> >>>>>>
>>>  >> >>>>>> *
>>>  >> 
>>> <https://github.com/datafusion-contrib/datafusion-c/tree/main/package>
>>>  >> >>>>>> *
>>>  >> >>>>>>
>>>  >>
>>>  
>>> <https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml>
>>>  >> >>>>>>
>>>  >> >>>>>> I can work on it in apache/arrow-adbc.
>>>  >> >>>>>>
>>>  >> >>>>>>
>>>  >> >>>>>> Thanks,
>>>  >> >>>>>> --
>>>  >> >>>>>> kou
>>>  >> >>>>>>
>>>  >> >>>>>> In <[email protected] 
>>> <mailto:[email protected]>>
>>>  >> >>>>>>   "Re: [DISC] Improving Arrow's database support" on Thu, 
>>> 25 Aug
>>>  >> 2022
>>>  >> >>>>>> 11:51:08 -0400,
>>>  >> >>>>>>   "David Li" <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>  >> >>>>>>
>>>  >> >>>>>>> Fair enough, thank you. I'll try to expand a bit. (Sorry 
>>> for the
>>>  >> wall of text that follows…)
>>>  >> >>>>>>>
>>>  >> >>>>>>> These are the components:
>>>  >> >>>>>>>
>>>  >> >>>>>>> - Core adbc.h header
>>>  >> >>>>>>> - Driver manager for C/C++
>>>  >> >>>>>>> - Flight SQL-based driver
>>>  >> >>>>>>> - Postgres-based driver (WIP)
>>>  >> >>>>>>> - SQLite-based driver (more of a testbed for me than an 
>>> actual
>>>  >> component - I don't think we'd actually distribute this)
>>>  >> >>>>>>> - Java core interfaces
>>>  >> >>>>>>> - Java driver manager
>>>  >> >>>>>>> - Java JDBC-based driver
>>>  >> >>>>>>> - Java Flight SQL-based driver
>>>  >> >>>>>>> - Python driver manager
>>>  >> >>>>>>>
>>>  >> >>>>>>> I think: adbc.h gets mirrored into the Arrow repo. The 
>>> Flight
>>>  SQL
>>>  >> drivers get moved to the main Arrow repo and distributed as part 
>>> of the
>>>  >> regular Arrow releases.
>>>  >> >>>>>>>
>>>  >> >>>>>>> For the rest of the components: they could be packaged
>>>  >> individually, but versioned and released together. Also, each 
>>> C/C++
>>>  driver
>>>  >> probably needs a corresponding Python package so Python users do 
>>> not
>>>  have
>>>  >> to futz with shared library configurations. (See [1].) So for 
>>> instance,
>>>  >> installing PyArrow would also give you the Flight SQL driver, 
>>> and `pip
>>>  >> install adbc_postgres` would get you the Postgres-based driver.
>>>  >> >>>>>>>
>>>  >> >>>>>>> That would mean setting up separate CI, release, etc. 
>>> (and
>>>  >> eventually linking Crossbow & Conbench as well?). That does mean
>>>  >> duplication of effort, but the trade off is avoiding bloating 
>>> the main
>>>  >> release process even further. However, I'd like to hear from 
>>> those
>>>  closer
>>>  >> to the release process on this subject - if it would make 
>>> people's lives
>>>  >> easier, we could merge everything into one repo/process.
>>>  >> >>>>>>>
>>>  >> >>>>>>> Integrations would be distributed as part of their 
>>> respective
>>>  >> packages (e.g. Arrow Dataset would optionally link to the driver
>>>  manager).
>>>  >> So the "part of Arrow 10.0.0" aspect means having a stable 
>>> interface for
>>>  >> adbc.h, and getting the Flight SQL drivers into the main repo.
>>>  >> >>>>>>>
>>>  >> >>>>>>> [1]: <https://github.com/apache/arrow-adbc/issues/53>
>>>  >> >>>>>>>
>>>  >> >>>>>>> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>>>  >> >>>>>>>> On Fri, 19 Aug 2022 14:09:44 -0400
>>>  >> >>>>>>>> "David Li" <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>>  >> >>>>>>>>> Since it's been a while, I'd like to give an update. 
>>> There are
>>>  >> also a few questions I have around distribution.
>>>  >> >>>>>>>>>
>>>  >> >>>>>>>>> Currently:
>>>  >> >>>>>>>>> - Supported in C, Java, and Python.
>>>  >> >>>>>>>>> - For C/Python, there are basic drivers wrapping 
>>> Flight SQL
>>>  and
>>>  >> SQLite, with a draft of a libpq (Postgres) driver (using 
>>> nanoarrow).
>>>  >> >>>>>>>>> - For Java, there are drivers wrapping JDBC and Flight 
>>> SQL.
>>>  >> >>>>>>>>> - For Python, there's low-level bindings to the C API, 
>>> and the
>>>  >> DBAPI interface on top of that (+a few extension methods 
>>> resembling
>>>  >> DuckDB/Turbodbc).
>>>  >> >>>>>>>>>
>>>  >> >>>>>>>>> There's drafts of integration with Ibis [1], DBI (R), 
>>> and
>>>  >> DuckDB. (I'd like to thank Hannes and Kirill for their comments, 
>>> as
>>>  well as
>>>  >> Antoine, Dewey, and Matt here.)
>>>  >> >>>>>>>>>
>>>  >> >>>>>>>>> I'd like to have this as part of 10.0.0 in some 
>>> fashion.
>>>  >> However, I'm not sure how we would like to handle packaging and
>>>  >> distribution. In particular, there are several sub-components 
>>> for each
>>>  >> language (the driver manager + the drivers), increasing the 
>>> work. Any
>>>  >> thoughts here?
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> Sorry, forgot to answer here. But I think your question 
>>> is too
>>>  >> broadly
>>>  >> >>>>>>>> formulated. It probably deserves a case-by-case 
>>> discussion,
>>>  IMHO.
>>>  >> >>>>>>>>
>>>  >> >>>>>>>>> I'm also wondering how we want to handle this in terms 
>>> of
>>>  >> specification - I assume we'd consider the core header file/Java
>>>  interfaces
>>>  >> a spec like the C Data Interface/Flight RPC, and vote on 
>>> them/mirror
>>>  them
>>>  >> into the format/ directory?
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> That sounds like the right way to me indeed.
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> Regards
>>>  >> >>>>>>>>
>>>  >> >>>>>>>> Antoine.
>>>  >>
>>>

Re: [DISC] Improving Arrow's database support

Reply via email to