Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-04-09 Thread Dirk Eddelbuettel


On 9 April 2024 at 18:45, Jose Manuel Abuin Mosquera wrote:
| If possible, I would like to contribute. At work we use the Go and 
| Python implementations, also, in the short term, we will start using the 
| Rust one.

Similar for us, and we have seen plenty of build headaches across pypi or
conda ...

(Hence my earlier hint about nanoarrow. No linking, uses the C API of two
void pointers.)

| Just to point out, the Rust version has its own native implementation, 
| here: https://github.com/apache/arrow-rs .

And IIRC there is an independent Arrow implementation (in Rust) used by polars
making it two possible ITPs: vanilla Arrow from Apache and Arrow from polars.

Dirk 

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-04-09 Thread Jose Manuel Abuin Mosquera




O 25/03/24 ás 19:17, Julian Gilbey escribiu:

Hi all,


Hi :)


[NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to
set to d-science and the RFP bug only]

An update on Apache Arrow, and in particular the Python library
PyArrow.  For those who don't know:

   Apache Arrow is a development platform for in-memory analytics. It
   contains a set of technologies that enable big data systems to
   process and move data fast. It specifies a standardized
   language-independent columnar memory format for flat and
   hierarchical data, organized for efficient analytic operations on
   modern hardware.

   The project is developing a multi-language collection of libraries
   for solving systems problems related to in-memory analytical data
   processing. This includes such topics as:

   * Zero-copy shared memory and RPC-based data movement

   * Reading and writing file formats (like CSV, Apache ORC, and Apache
 Parquet)

   * In-memory analytics and query processing

   (from: https://arrow.apache.org/docs/index.html)

Pandas has announced that Pandas 3.x will depend on PyArrow
in a critical way (it will back the "string" datatype), and it is due
to be released imminently.

So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!  There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)  Unfortunately I don't have the capacity to devote any time to
it myself.

Thanks in advance for anyone who can step forward for this!

Best wishes,

Julian



If possible, I would like to contribute. At work we use the Go and 
Python implementations, also, in the short term, we will start using the 
Rust one.


Just to point out, the Rust version has its own native implementation, 
here: https://github.com/apache/arrow-rs .


Cheers,

Jose

--
José Manuel Abuín Mosquera
PhD. | Scientific Software Developer | Researcher

http://jmabuin.github.io



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-04-04 Thread Thomas Goirand

On 3/25/24 19:17, Julian Gilbey wrote:

Hi all,

[NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to
set to d-science and the RFP bug only]

An update on Apache Arrow, and in particular the Python library
PyArrow.  For those who don't know:

   Apache Arrow is a development platform for in-memory analytics. It
   contains a set of technologies that enable big data systems to
   process and move data fast. It specifies a standardized
   language-independent columnar memory format for flat and
   hierarchical data, organized for efficient analytic operations on
   modern hardware.

   The project is developing a multi-language collection of libraries
   for solving systems problems related to in-memory analytical data
   processing. This includes such topics as:

   * Zero-copy shared memory and RPC-based data movement

   * Reading and writing file formats (like CSV, Apache ORC, and Apache
 Parquet)

   * In-memory analytics and query processing

   (from: https://arrow.apache.org/docs/index.html)

Pandas has announced that Pandas 3.x will depend on PyArrow
in a critical way (it will back the "string" datatype), and it is due
to be released imminently.

So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!  There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)  Unfortunately I don't have the capacity to devote any time to
it myself.

Thanks in advance for anyone who can step forward for this!

Best wishes,

Julian


Hi,

I may not have much available time to help, though I'd love to have 
Arrow in Debian, as Ceph uses it, and currently use an embedded version.


Cheers,

Thomas Goirand (zigo)



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-04-04 Thread Richard Duivenvoorde

On 3/25/24 7:17 PM, Julian Gilbey wrote:

So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!  There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)  Unfortunately I don't have the capacity to devote any time to
it myself.

Thanks in advance for anyone who can step forward for this!


As someone from the Debian-GIS community, I would also be very interested in 
this!

The Apache Arrow C++ library is one of the dependencies to make GDAL/OGR able 
to read/write (geo)parquet files, a data format with a lot traction in the geo 
community [0]. Thereby making it possible for QGIS to handle those (on Debian).

[0] 
https://cloudnativegeo.org/blog/2023/09/duckdb-the-indispensable-geospatial-tool-you-didnt-know-you-were-missing/

Regards,

Richard Duivenvoorde



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-31 Thread Dirk Eddelbuettel


Julian,

Arrow is a complicated and large package. We use it at work (where there is a
fair amount of Python, also to Conda etc) and do have issues with more
complex builds especially because it is 'data infrastructure' and can come in
from different parts. I would recommend against packaging at old one -- we
also have seen issues with different (py)arrow version biting.

Have you seen https://github.com/apache/arrow-nanoarrow ?

It works via the C API to Arrow which interchanges data via two void* to the
the two structs for arrow array and schema -- and avoids linkage issue. (In
user space the pyarrow or R arrow packages can still be used also interfacing
via these.)  I have been using it for R package bindings for some time and we
plan to expand that (again, at work) -- as do others. It is already use by
duckdb, by the Arrow 'ADBC' interfaces (which are generic in the ODBC/JDBC
sense but for Arrow, and also by a python interface to snowflake.

Dirk

-- 
dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-31 Thread Julian Gilbey
Hi Diane,

On Sat, Mar 30, 2024 at 08:59:39PM -0700, Diane Trout wrote:
> Hi Julian,
> 
> On Sat, 2024-03-30 at 20:22 +, Julian Gilbey wrote:
> > Lovely to hear from you, and oh wow, that's amazing, thank you!
> > 
> > I can't speak for anyone else, but I suggest that pushing your
> > updates
> > to the science-team package would be very sensible; it would be silly
> > for someone else to have to redo your work.
> > 
> > What more is needed for it to be ready for unstable?
> 
> 
> The things I think are kind of broken are:
> 
> We've got 7.0.0 and upstreams current version is 15.0.2.

Yes, that does seem a little less than ideal!

> the pyarrow 7.0.0 tests fail because it depends on a python test
> library that breaks with pytest 8.0. Either I need to disable the
> python tests or upgrade to a newer version.

It may well be that newer versions would work with pytest 8.x.  I
don't think it's worth spending time trying to patch such a relatively
old version.

> My upgrade didn't go smoothly because uscan found also upstreams debian
> watch file which is too loose and matches some other tar balls on their
> distribution site.
> 
> (Though I don't know why uscan keeps looking for watch files after
> finding one in debian/watch)

Oh dear.  uscan(1) does say:

   Unless --watchfile is given, uscan looks recursively for valid source
   trees starting from the current directory (see the below section
   "Directory name checking" for details).

and then:

   For each valid source tree found, typically the following happens:
   [...]

so yes, it will look at more than one location.

> And you were probably right in that arrow needs to be a team, because I
> have no idea how to get other the other languages interfaces packaged.

I suggest that without anyone else volunteering to do those other
language interfaces (perhaps it's not a pressing need for people
working with language X), I wonder whether it's worth just packaging
the Python (and presumably C++) interfaces for now, and then if others
want to join the effort to support language X later on, a new version
of the Debian package can be uploaded with a new binary package for
language X.  It does mean more trips through the NEW queue if and when
that happens, but given that no-one's shown interest in language X for
the last several years, this is unlikely to be much of an issue.

Version 7.0 provided support (it seems) for: GLib (seems that a draft
framework for building this is already in the Debian package, and it
can then be used in lots of languages), C++ (this is the core
libraries), C# (not of interest to us), Go, Java, JavaScript, Julia,
Matlab (not of interest to us), Python, R, Ruby.

> Oh and I probably need to get the pyarrow installed somewhere, since it
> was stopping at the tests I hadn't run into dh_missing errors yet.

Oh.  Would pybuild do that automatically (perhaps specifying
PYBUILD_PACKAGE)?

Best wishes,

   Julian



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-30 Thread Diane Trout
Hi Julian,

On Sat, 2024-03-30 at 20:22 +, Julian Gilbey wrote:
> Lovely to hear from you, and oh wow, that's amazing, thank you!
> 
> I can't speak for anyone else, but I suggest that pushing your
> updates
> to the science-team package would be very sensible; it would be silly
> for someone else to have to redo your work.
> 
> What more is needed for it to be ready for unstable?


The things I think are kind of broken are:

We've got 7.0.0 and upstreams current version is 15.0.2.

the pyarrow 7.0.0 tests fail because it depends on a python test
library that breaks with pytest 8.0. Either I need to disable the
python tests or upgrade to a newer version.

My upgrade didn't go smoothly because uscan found also upstreams debian
watch file which is too loose and matches some other tar balls on their
distribution site.

(Though I don't know why uscan keeps looking for watch files after
finding one in debian/watch)

And you were probably right in that arrow needs to be a team, because I
have no idea how to get other the other languages interfaces packaged.

Oh and I probably need to get the pyarrow installed somewhere, since it
was stopping at the tests I hadn't run into dh_missing errors yet.

Diane



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-30 Thread Julian Gilbey
Hi Diane,

On Fri, Mar 29, 2024 at 11:49:07AM -0700, Diane Trout wrote:
> On Mon, 2024-03-25 at 18:17 +, Julian Gilbey wrote:
> > 
> > 
> > So this is a plea for anyone looking for something really helpful to
> > do: it would be great to have a group of developers finally package
> > this!  There was some initial work done (see the RFP bug report for
> > details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
> > but that is fairly old now.  As Apache Arrow supports numerous
> > languages, it may well benefit from having a group of developers with
> > different areas of expertise to build it.  (Or perhaps it would make
> > more sense to split the upstream source into a collection of
> > different
> > Debian source packages for the different supported languages.  I
> > don't
> > know.)  Unfortunately I don't have the capacity to devote any time to
> > it myself.
> > 
> > Thanks in advance for anyone who can step forward for this!
> 
> I've been maintain dask and anndata and saw that apache arrow was
> getting increasingly popular.
> 
> I took the current science-team preliminary packaging 7.0.0 packaging
> and managed to get it to build through a combination of patches and
> turning off features.
> 
> I even mostly managed to get pyarrow to build. (Though some tests fail
> due to pytest lazy-fixture being abandoned).
> 
> I pushed my current work in progress to.
> 
> https://salsa.debian.org/diane/arrow.git
> 
> Was anyone else planning on working on it or should I push my updates
> to the science-team package?

Lovely to hear from you, and oh wow, that's amazing, thank you!

I can't speak for anyone else, but I suggest that pushing your updates
to the science-team package would be very sensible; it would be silly
for someone else to have to redo your work.

What more is needed for it to be ready for unstable?

Best wishes,

   Julian



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-29 Thread Rene Engelhard

Hi,

Am 25.03.24 um 19:17 schrieb Julian Gilbey:

   * Reading and writing file formats (like CSV, Apache ORC, and Apache
 Parquet)


liborcus supports this (Apache Parquet) if built with Apache Arrow. And 
thus makes LibreOffice being able to handle it.


I didn't invest any time in Apache Arrow since I am already too low on 
time anyway and I deemed it too a "low popularity" thing anyway.



So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!

Indeed.

There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)


Would definitely make transitions easier.


  Unfortunately I don't have the capacity to devote any time to
it myself.


Dito.


Regards,


Rene



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-29 Thread Diane Trout
On Mon, 2024-03-25 at 18:17 +, Julian Gilbey wrote:
> 
> 
> So this is a plea for anyone looking for something really helpful to
> do: it would be great to have a group of developers finally package
> this!  There was some initial work done (see the RFP bug report for
> details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
> but that is fairly old now.  As Apache Arrow supports numerous
> languages, it may well benefit from having a group of developers with
> different areas of expertise to build it.  (Or perhaps it would make
> more sense to split the upstream source into a collection of
> different
> Debian source packages for the different supported languages.  I
> don't
> know.)  Unfortunately I don't have the capacity to devote any time to
> it myself.
> 
> Thanks in advance for anyone who can step forward for this!

I've been maintain dask and anndata and saw that apache arrow was
getting increasingly popular.

I took the current science-team preliminary packaging 7.0.0 packaging
and managed to get it to build through a combination of patches and
turning off features.

I even mostly managed to get pyarrow to build. (Though some tests fail
due to pytest lazy-fixture being abandoned).

I pushed my current work in progress to.

https://salsa.debian.org/diane/arrow.git

Was anyone else planning on working on it or should I push my updates
to the science-team package?

Diane



Bug#970021: Seeking a small group to package Apache Arrow (was: Bug#970021: RFP: apache-arrow -- cross-language development platform for in-memory analytics)

2024-03-25 Thread Julian Gilbey
Hi all,

[NB: sent to d-science, d-python, d-devel and the RFP bug; reply-to
set to d-science and the RFP bug only]

An update on Apache Arrow, and in particular the Python library
PyArrow.  For those who don't know:

  Apache Arrow is a development platform for in-memory analytics. It
  contains a set of technologies that enable big data systems to
  process and move data fast. It specifies a standardized
  language-independent columnar memory format for flat and
  hierarchical data, organized for efficient analytic operations on
  modern hardware.

  The project is developing a multi-language collection of libraries
  for solving systems problems related to in-memory analytical data
  processing. This includes such topics as:

  * Zero-copy shared memory and RPC-based data movement

  * Reading and writing file formats (like CSV, Apache ORC, and Apache
Parquet)

  * In-memory analytics and query processing

  (from: https://arrow.apache.org/docs/index.html)

Pandas has announced that Pandas 3.x will depend on PyArrow
in a critical way (it will back the "string" datatype), and it is due
to be released imminently.

So this is a plea for anyone looking for something really helpful to
do: it would be great to have a group of developers finally package
this!  There was some initial work done (see the RFP bug report for
details: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021),
but that is fairly old now.  As Apache Arrow supports numerous
languages, it may well benefit from having a group of developers with
different areas of expertise to build it.  (Or perhaps it would make
more sense to split the upstream source into a collection of different
Debian source packages for the different supported languages.  I don't
know.)  Unfortunately I don't have the capacity to devote any time to
it myself.

Thanks in advance for anyone who can step forward for this!

Best wishes,

   Julian