Hi Sutou, cool, thanks for your comments! Let's see if I can elaborate a bit more on my ideas.
> I'm the original author of the Debian packages for Debian. > I'm positive that Apache Arrow package exists in the > official Debian repository. I checked but could not find any: searching in the Debian package index lists nothing related to Apache Arrow [0]. Also, I had filed an RFP (request for packaging) a long time ago [1], and if there had been such a package, I am sure the maintainer would have closed the RFP and directed me towards the existing package ;) >> I do have a working package based on the JFrog packaging groundwork [0] >> but had to make various changes mostly to avoid downloading dependencies >> from the Internet (which is not allowed during the Debian build >> process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning >> enabled/disabled features based on what we have and what we don't. >> Result is at [1]. > > Could you create a "diff -ru" output between [0] and [1]? Sure, attached. I think some changed lines can also be reverted to your original state, I may have oversimplified some of them during hacking. dh_auto_test is currently disabled because the rules target downloads stuff. Maybe we want to package that as well to avoid 'git clone' there. >> The only exception here are ORC and S3 support, which are >> missing because the ORC library [2] and the AWS C++ SDK >> [3] are not packaged yet. > > Do you have a plan to package them? If they exist in the > official Debian repository, we can use them. I don't really have such a plan, sorry. They have in turn numerous dependencies which would also need to be packaged separately, and I do not need them myself. It would be easy to enable support for these features as soon as _somebody_ packages them eventually. Which I am pretty confident will happen, as I guess AWS is not going to go away soon ;) >> 1.) Would somebody from the upstream team be interested in collaborating >> to keep Arrow maintained in Debian? I would be able to review updates >> and sponsor uploads. > > I'm interested in it. How about the following way? > > 1. You open pull requests for each your improvement > to https://github.com/apache/arrow/ . > > 2. We mention you on GitHub when we open a pull request > that is related to Debian packages such as > https://github.com/apache/arrow/pull/10514 . > > 3. You upload our Debian package to the official Debian > repository when we release a new version. > You can notice a new release on this mailing list. Interesting -- that's not how it usually works. Debian packaging code is not expected not live within the upstream code repository but within a dedicated packaging repository (see [2] as an example) which contains the upstream code (version-tracked in a separate branch), the debian directory and an additional pristine-tar branch to produce byte-correct replicates of the original upstream tarball. Most currently popular and reliable Debian development tooling (such as git-buildpackage) implicitly expects and requires this layout. The packaging repo is typically also supposed (but not required) to live on salsa.debian.org, the official Debian development GitLab. But usually most upstream projects do not want to have these Debian-specific branches cluttering their repo space. Also, are you suggesting that when you say I upload "your" Debian package, do you mean the .debs? Because for something to get accepted into Debian, we need to only upload _source_ packages, not binary packages. Each package must be built on Debian servers from source. So... No offense, but I don't think merging my packaging code into yours is the best idea. What do you think about the following, more established approach: 0) You clone the salsa repository [2] locally and keep it in sync with the version on salsa. 1) You release a new version via GitHub. That means there will be a new release tarball to download via uscan. 2) You import the new tarball into your local packaging repo with 'gbp import-orig', update debian/changelog to reflect the new version, update debian/copyright if there are new files, refresh patches, etc. 3) You build a new package with git-buildpackage in a local chroot (e.g. with sbuild or cowbuilder, ...) to make sure that everything builds correctly. 4) You push your changeset to the salsa repo, tag a Debian version and ping me to review the packaging. I will then build a source package, sign it and upload it to be built on Debian's build farm for all platforms. That is the workflow for releasing a new version, would of course be similar for other updates (bugfixes in the packaging, etc). I would make sure you get all the necessary permissions to work on the salsa repository. What do you think? I know that this would mean moving the Debian packaging workflow outside of your Arrow repository, but I think it would make life easier in the long run. Another option would be that you just send me a source package for each version you'd like to see uploaded (*.orig.tar.gz, *.dsc and *.debian.tar.xz) and I would use that for review and upload. But then any change that I might want to do would need to eventually be fed back into your upstream repository, and I think we can do without the extra round-trip if we keep everything Debian-related in one place. [...] >> Is the LICENSE.txt in the Arrow source root directory complete and lists >> _all_ third-party licenses and copyright holders in the release tarball? > > No. Most of them are covered but some of them only exists in > source code such as > https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/mman.h . Okay, looks like I'd have to actually look through everything, gathering and documenting licenses. Might take a while :D Thanks Sascha [0] https://packages.debian.org/search?suite=default§ion=all&arch=any&searchon=names&keywords=arrow [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021 [2] https://salsa.debian.org/satta/arrow
diff -ru ./control /home/satta/pkg/dcso/arrow/arrow/debian/control --- ./control 2021-04-21 18:11:38.000000000 +0200 +++ /home/satta/pkg/dcso/arrow/arrow/debian/control 2021-05-18 22:44:16.705459282 +0200 @@ -4,7 +4,7 @@ Maintainer: Apache Arrow Developers <dev@arrow.apache.org> Build-Depends: cmake, - debhelper (>= 12), + debhelper (>= 13), git, gobject-introspection, gtk-doc-tools, @@ -26,11 +26,21 @@ ninja-build, nvidia-cuda-toolkit [!arm64], pkg-config, + libprotobuf-dev, + libprotoc-dev, protobuf-compiler-grpc, python3-dev, python3-numpy, + rapidjson-dev, + libjemalloc-dev, + libthrift-dev, tzdata, - zlib1g-dev + zlib1g-dev, + clang, + clang-tidy, + llvm, + llvm-dev, + meson Build-Depends-Indep: libglib2.0-doc Standards-Version: 3.9.8 Homepage: https://arrow.apache.org/ @@ -53,9 +63,9 @@ Multi-Arch: same Pre-Depends: ${misc:Pre-Depends} Depends: - ${misc:Depends}, - ${shlibs:Depends}, - libarrow400 (= ${binary:Version}) + ${misc:Depends}, + ${shlibs:Depends}, + libarrow400 (= ${binary:Version}) Description: Apache Arrow is a data processing library for analysis . This package provides C++ library files for CUDA support. @@ -133,6 +143,9 @@ libutf8proc-dev, libzstd-dev, protobuf-compiler-grpc, + libprotobuf-dev, + libprotoc-dev, + libthrift-dev, zlib1g-dev Description: Apache Arrow is a data processing library for analysis . @@ -143,9 +156,9 @@ Architecture: i386 amd64 Multi-Arch: same Depends: - ${misc:Depends}, - libarrow-dev (= ${binary:Version}), - libarrow-cuda400 (= ${binary:Version}) + ${misc:Depends}, + libarrow-dev (= ${binary:Version}), + libarrow-cuda400 (= ${binary:Version}) Description: Apache Arrow is a data processing library for analysis . This package provides C++ header files for CUDA support. @@ -206,11 +219,10 @@ Multi-Arch: same Pre-Depends: ${misc:Pre-Depends} Depends: - ${misc:Depends}, - ${shlibs:Depends}, - libarrow400 (= ${binary:Version}) -Description: Gandiva is a toolset for compiling and evaluating expressions - on Arrow Data. + ${misc:Depends}, + ${shlibs:Depends}, + libarrow400 (= ${binary:Version}) +Description: Gandiva is a toolset for compiling and evaluating expressions on Arrow Data. . This package provides C++ library files. @@ -222,8 +234,7 @@ ${misc:Depends}, libarrow-dev (= ${binary:Version}), libgandiva400 (= ${binary:Version}) -Description: Gandiva is a toolset for compiling and evaluating expressions - on Arrow Data. +Description: Gandiva is a toolset for compiling and evaluating expressions on Arrow Data. . This package provides C++ header files. diff -ru ./libarrow-dev.install /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-dev.install --- ./libarrow-dev.install 2021-04-21 18:11:38.000000000 +0200 +++ /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-dev.install 2021-05-18 22:01:42.307671397 +0200 @@ -7,6 +7,7 @@ usr/lib/*/cmake/arrow/FindBrotli.cmake usr/lib/*/cmake/arrow/FindLz4.cmake usr/lib/*/cmake/arrow/FindSnappy.cmake +usr/lib/*/cmake/arrow/FindThrift.cmake usr/lib/*/cmake/arrow/Findutf8proc.cmake usr/lib/*/cmake/arrow/Findzstd.cmake usr/lib/*/cmake/arrow/arrow-config.cmake @@ -17,5 +18,5 @@ usr/lib/*/pkgconfig/arrow-csv.pc usr/lib/*/pkgconfig/arrow-filesystem.pc usr/lib/*/pkgconfig/arrow-json.pc -usr/lib/*/pkgconfig/arrow-orc.pc +#usr/lib/*/pkgconfig/arrow-orc.pc usr/lib/*/pkgconfig/arrow.pc diff -ru ./libarrow-glib-dev.install /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-glib-dev.install --- ./libarrow-glib-dev.install 2021-04-21 18:11:38.000000000 +0200 +++ /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-glib-dev.install 2021-05-18 21:36:48.805671734 +0200 @@ -1,6 +1,6 @@ usr/include/arrow-glib/ usr/lib/*/libarrow-glib.so usr/lib/*/pkgconfig/arrow-glib.pc -usr/lib/*/pkgconfig/arrow-orc-glib.pc +#usr/lib/*/pkgconfig/arrow-orc-glib.pc usr/share/arrow-glib/example/ usr/share/gir-1.0/Arrow-1.0.gir Only in /home/satta/pkg/dcso/arrow/arrow/debian/patches: fix-llvm-requirement.patch diff -ru ./patches/series /home/satta/pkg/dcso/arrow/arrow/debian/patches/series --- ./patches/series 2021-04-21 18:11:38.000000000 +0200 +++ /home/satta/pkg/dcso/arrow/arrow/debian/patches/series 2021-05-16 15:28:28.184586150 +0200 @@ -0,0 +1 @@ +fix-llvm-requirement.patch diff -ru ./rules /home/satta/pkg/dcso/arrow/arrow/debian/rules --- ./rules 2021-04-21 18:11:38.000000000 +0200 +++ /home/satta/pkg/dcso/arrow/arrow/debian/rules 2021-05-18 21:20:26.778902537 +0200 @@ -14,28 +14,28 @@ dh $@ --with gir override_dh_auto_configure: - if dpkg -l nvidia-cuda-toolkit > /dev/null 2>&1; then \ - ARROW_CUDA=ON; \ - ARROW_PLASMA=ON; \ - else \ - ARROW_CUDA=OFF; \ - ARROW_PLASMA=OFF; \ - fi; \ - dh_auto_configure \ + dh_auto_configure \ --sourcedirectory=cpp \ - --builddirectory=cpp_build \ + --builddirectory=cpp_build \ --buildsystem=cmake+ninja \ - -- \ - -DARROW_CUDA=$${ARROW_CUDA} \ - -DARROW_FLIGHT=ON \ + -- \ + -DARROW_CUDA=ON \ + -DARROW_CSV=ON \ + -DARROW_COMPUTE=ON \ + -DARROW_DATASET=ON \ + -DARROW_DEPENDENCY_SOURCE=SYSTEM \ + -DARROW_JSON=ON \ + -DARROW_FILESYSTEM=ON \ + -DARROW_FLIGHT=ON \ -DARROW_GANDIVA=ON \ -DARROW_GANDIVA_JAVA=OFF \ - -DARROW_MIMALLOC=ON \ - -DARROW_ORC=ON \ + -DARROW_JEMALLOC=ON \ + -DARROW_MIMALLOC=OFF \ + -DARROW_ORC=OFF \ -DARROW_PARQUET=ON \ - -DARROW_PLASMA=$${ARROW_PLASMA} \ - -DARROW_PYTHON=ON \ - -DARROW_S3=ON \ + -DARROW_PLASMA=ON \ + -DARROW_PYTHON=ON \ + -DARROW_S3=OFF \ -DARROW_USE_CCACHE=OFF \ -DARROW_WITH_BROTLI=ON \ -DARROW_WITH_BZ2=ON \ @@ -43,11 +43,7 @@ -DARROW_WITH_SNAPPY=ON \ -DARROW_WITH_ZLIB=ON \ -DARROW_WITH_ZSTD=ON \ - -DCMAKE_BUILD_TYPE=$(BUILD_TYPE) \ - -DCMAKE_UNITY_BUILD=ON \ - -DPARQUET_REQUIRE_ENCRYPTION=ON \ - -DPythonInterp_FIND_VERSION=ON \ - -DPythonInterp_FIND_VERSION_MAJOR=3 + -DCMAKE_BUILD_TYPE=$(BUILD_TYPE) override_dh_auto_build: dh_auto_build \ @@ -88,15 +84,7 @@ --builddirectory=cpp_build override_dh_auto_test: - # TODO: We need Boost 1.64 or later to build tests for - # Apache Arrow Flight. - # git clone --depth 1 https://github.com/apache/arrow-testing.git - # git clone --depth 1 https://github.com/apache/parquet-testing.git - # cd cpp_build && \ - # env \ - # ARROW_TEST_DATA=$(CURDIR)/arrow-testing/data \ - # PARQUET_TEST_DATA=$(CURDIR)/parquet-testing/data \ - # ctest --exclude-regex 'arrow-cuda-test|plasma-client_tests' + # pass # skip file failing with "Unknown DWARF DW_OP_255" (see bug#949296) override_dh_dwz:
OpenPGP_signature
Description: OpenPGP digital signature