Re: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-31 Thread Dirk Eddelbuettel


Julian,

Arrow is a complicated and large package. We use it at work (where there is a
fair amount of Python, also to Conda etc) and do have issues with more
complex builds especially because it is 'data infrastructure' and can come in
from different parts. I would recommend against packaging at old one -- we
also have seen issues with different (py)arrow version biting.

Have you seen https://github.com/apache/arrow-nanoarrow ?

It works via the C API to Arrow which interchanges data via two void* to the
the two structs for arrow array and schema -- and avoids linkage issue. (In
user space the pyarrow or R arrow packages can still be used also interfacing
via these.)  I have been using it for R package bindings for some time and we
plan to expand that (again, at work) -- as do others. It is already use by
duckdb, by the Arrow 'ADBC' interfaces (which are generic in the ODBC/JDBC
sense but for Arrow, and also by a python interface to snowflake.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org



Re: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-31 Thread Julian Gilbey
Hi Diane,

On Sat, Mar 30, 2024 at 08:59:39PM -0700, Diane Trout wrote:
> Hi Julian,
> 
> On Sat, 2024-03-30 at 20:22 +, Julian Gilbey wrote:
> > Lovely to hear from you, and oh wow, that's amazing, thank you!
> > 
> > I can't speak for anyone else, but I suggest that pushing your
> > updates
> > to the science-team package would be very sensible; it would be silly
> > for someone else to have to redo your work.
> > 
> > What more is needed for it to be ready for unstable?
> 
> 
> The things I think are kind of broken are:
> 
> We've got 7.0.0 and upstreams current version is 15.0.2.

Yes, that does seem a little less than ideal!

> the pyarrow 7.0.0 tests fail because it depends on a python test
> library that breaks with pytest 8.0. Either I need to disable the
> python tests or upgrade to a newer version.

It may well be that newer versions would work with pytest 8.x.  I
don't think it's worth spending time trying to patch such a relatively
old version.

> My upgrade didn't go smoothly because uscan found also upstreams debian
> watch file which is too loose and matches some other tar balls on their
> distribution site.
> 
> (Though I don't know why uscan keeps looking for watch files after
> finding one in debian/watch)

Oh dear.  uscan(1) does say:

   Unless --watchfile is given, uscan looks recursively for valid source
   trees starting from the current directory (see the below section
   "Directory name checking" for details).

and then:

   For each valid source tree found, typically the following happens:
   [...]

so yes, it will look at more than one location.

> And you were probably right in that arrow needs to be a team, because I
> have no idea how to get other the other languages interfaces packaged.

I suggest that without anyone else volunteering to do those other
language interfaces (perhaps it's not a pressing need for people
working with language X), I wonder whether it's worth just packaging
the Python (and presumably C++) interfaces for now, and then if others
want to join the effort to support language X later on, a new version
of the Debian package can be uploaded with a new binary package for
language X.  It does mean more trips through the NEW queue if and when
that happens, but given that no-one's shown interest in language X for
the last several years, this is unlikely to be much of an issue.

Version 7.0 provided support (it seems) for: GLib (seems that a draft
framework for building this is already in the Debian package, and it
can then be used in lots of languages), C++ (this is the core
libraries), C# (not of interest to us), Go, Java, JavaScript, Julia,
Matlab (not of interest to us), Python, R, Ruby.

> Oh and I probably need to get the pyarrow installed somewhere, since it
> was stopping at the tests I hadn't run into dh_missing errors yet.

Oh.  Would pybuild do that automatically (perhaps specifying
PYBUILD_PACKAGE)?

Best wishes,

   Julian