Re: [DISC] Improving Arrow's database support

Sutou Kouhei Thu, 25 Aug 2022 22:01:44 -0700

Hi,

As a maintainer of Linux packages, I want apache/arrow-adbc
to be released before apache/arrow is released so that
apache/arrow's .deb/.rpm can depend on apache/arrow-adbc's
.deb/.rpm.


(If Apache Arrow Dataset uses apache/arrow-adbc,
apache/arrow's .deb/.rpm needs to depend on
apache/arrow-adbc's .deb/.rpm.)

We can add .deb/.rpm related files
(dev/tasks/linux-packages/ in apache/arrow) to
apache/arrow-adbc to build .deb/.rpm for apache/arrow-adbc.

FYI: I did it for datafusion-contrib/datafusion-c:

* https://github.com/datafusion-contrib/datafusion-c/tree/main/package
* 
https://github.com/datafusion-contrib/datafusion-c/blob/main/.github/workflows/package.yaml

I can work on it in apache/arrow-adbc.


Thanks,
-- 
kou

In <5cbf2923-4fb4-4c5e-b11d-007209fdd...@www.fastmail.com>
  "Re: [DISC] Improving Arrow's database support" on Thu, 25 Aug 2022 11:51:08 
-0400,
  "David Li" <lidav...@apache.org> wrote:

> Fair enough, thank you. I'll try to expand a bit. (Sorry for the wall of text 
> that follows…)
> 
> These are the components:
> 
> - Core adbc.h header
> - Driver manager for C/C++
> - Flight SQL-based driver
> - Postgres-based driver (WIP)
> - SQLite-based driver (more of a testbed for me than an actual component - I 
> don't think we'd actually distribute this)
> - Java core interfaces
> - Java driver manager
> - Java JDBC-based driver
> - Java Flight SQL-based driver
> - Python driver manager
> 
> I think: adbc.h gets mirrored into the Arrow repo. The Flight SQL drivers get 
> moved to the main Arrow repo and distributed as part of the regular Arrow 
> releases.
> 
> For the rest of the components: they could be packaged individually, but 
> versioned and released together. Also, each C/C++ driver probably needs a 
> corresponding Python package so Python users do not have to futz with shared 
> library configurations. (See [1].) So for instance, installing PyArrow would 
> also give you the Flight SQL driver, and `pip install adbc_postgres` would 
> get you the Postgres-based driver.
> 
> That would mean setting up separate CI, release, etc. (and eventually linking 
> Crossbow & Conbench as well?). That does mean duplication of effort, but the 
> trade off is avoiding bloating the main release process even further. 
> However, I'd like to hear from those closer to the release process on this 
> subject - if it would make people's lives easier, we could merge everything 
> into one repo/process.
> 
> Integrations would be distributed as part of their respective packages (e.g. 
> Arrow Dataset would optionally link to the driver manager). So the "part of 
> Arrow 10.0.0" aspect means having a stable interface for adbc.h, and getting 
> the Flight SQL drivers into the main repo.
> 
> [1]: https://github.com/apache/arrow-adbc/issues/53
> 
> On Thu, Aug 25, 2022, at 11:34, Antoine Pitrou wrote:
>> On Fri, 19 Aug 2022 14:09:44 -0400
>> "David Li" <lidav...@apache.org> wrote:
>>> Since it's been a while, I'd like to give an update. There are also a few 
>>> questions I have around distribution.
>>> 
>>> Currently:
>>> - Supported in C, Java, and Python.
>>> - For C/Python, there are basic drivers wrapping Flight SQL and SQLite, 
>>> with a draft of a libpq (Postgres) driver (using nanoarrow).
>>> - For Java, there are drivers wrapping JDBC and Flight SQL.
>>> - For Python, there's low-level bindings to the C API, and the DBAPI 
>>> interface on top of that (+a few extension methods resembling 
>>> DuckDB/Turbodbc).
>>>  
>>> There's drafts of integration with Ibis [1], DBI (R), and DuckDB. (I'd like 
>>> to thank Hannes and Kirill for their comments, as well as Antoine, Dewey, 
>>> and Matt here.)
>>> 
>>> I'd like to have this as part of 10.0.0 in some fashion. However, I'm not 
>>> sure how we would like to handle packaging and distribution. In particular, 
>>> there are several sub-components for each language (the driver manager + 
>>> the drivers), increasing the work. Any thoughts here?
>>
>> Sorry, forgot to answer here. But I think your question is too broadly
>> formulated. It probably deserves a case-by-case discussion, IMHO.
>>
>>> I'm also wondering how we want to handle this in terms of specification - I 
>>> assume we'd consider the core header file/Java interfaces a spec like the C 
>>> Data Interface/Flight RPC, and vote on them/mirror them into the format/ 
>>> directory?
>>
>> That sounds like the right way to me indeed.
>>
>> Regards
>>
>> Antoine.

Re: [DISC] Improving Arrow's database support

Reply via email to