Hi Sutou,

cool, thanks for your comments! Let's see if I can elaborate a bit more
on my ideas.

> I'm the original author of the Debian packages for Debian.
> I'm positive that Apache Arrow package exists in the
> official Debian repository.

I checked but could not find any: searching in the Debian package index
lists nothing related to Apache Arrow [0].

Also, I had filed an RFP (request for packaging) a long time ago [1],
and if there had been such a package, I am sure the maintainer would
have closed the RFP and directed me towards the existing package ;)

>> I do have a working package based on the JFrog packaging groundwork [0]
>> but had to make various changes mostly to avoid downloading dependencies
>> from the Internet (which is not allowed during the Debian build
>> process). So, mostly setting -DARROW_DEPENDENCY_SOURCE=SYSTEM and tuning
>> enabled/disabled features based on what we have and what we don't.
>> Result is at [1].
> 
> Could you create a "diff -ru" output between [0] and [1]?

Sure, attached. I think some changed lines can also be reverted to your
original state, I may have oversimplified some of them during hacking.

dh_auto_test is currently disabled because the rules target downloads
stuff. Maybe we want to package that as well to avoid 'git clone' there.

>> The only exception here are ORC and S3 support, which are
>> missing because the ORC library [2] and the AWS C++ SDK
>> [3] are not packaged yet.
> 
> Do you have a plan to package them? If they exist in the
> official Debian repository, we can use them.

I don't really have such a plan, sorry. They have in turn numerous
dependencies which would also need to be packaged separately, and I do
not need them myself. It would be easy to enable support for these
features as soon as _somebody_ packages them eventually. Which I am
pretty confident will happen, as I guess AWS is not going to go away soon 
;)

>> 1.) Would somebody from the upstream team be interested in collaborating
>> to keep Arrow maintained in Debian? I would be able to review updates
>> and sponsor uploads.
> 
> I'm interested in it. How about the following way?
> 
>   1. You open pull requests for each your improvement
>      to https://github.com/apache/arrow/ .
> 
>   2. We mention you on GitHub when we open a pull request
>      that is related to Debian packages such as
>      https://github.com/apache/arrow/pull/10514 .
> 
>   3. You upload our Debian package to the official Debian
>      repository when we release a new version.
>      You can notice a new release on this mailing list.

Interesting -- that's not how it usually works. Debian packaging code is
not expected not live within the upstream code repository but within a
dedicated packaging repository (see [2] as an example) which contains
the upstream code (version-tracked in a separate branch), the debian
directory and an additional pristine-tar branch to produce byte-correct
replicates of the original upstream tarball. Most currently popular and
reliable Debian development tooling (such as git-buildpackage)
implicitly expects and requires this layout. The packaging repo is
typically also supposed (but not required) to live on salsa.debian.org,
the official Debian development GitLab. But usually most upstream
projects do not want to have these Debian-specific branches cluttering
their repo space.

Also, are you suggesting that when you say I upload "your" Debian
package, do you mean the .debs? Because for something to get accepted
into Debian, we need to only upload _source_ packages, not binary
packages. Each package must be built on Debian servers from source.

So... No offense, but I don't think merging my packaging code into yours
is the best idea.
What do you think about the following, more established approach:

0) You clone the salsa repository [2] locally and keep it in sync with
the version on salsa.

1) You release a new version via GitHub. That means there will be a new
release tarball to download via uscan.

2) You import the new tarball into your local packaging repo with 'gbp
import-orig', update debian/changelog to reflect the new version, update
debian/copyright if there are new files, refresh patches, etc.

3) You build a new package with git-buildpackage in a local chroot (e.g.
with sbuild or cowbuilder, ...) to make sure that everything builds
correctly.

4) You push your changeset to the salsa repo, tag a Debian version and
ping me to review the packaging. I will then build a source package,
sign it and upload it to be built on Debian's build farm for all platforms.

That is the workflow for releasing a new version, would of course be
similar for other updates (bugfixes in the packaging, etc). I would make
sure you get all the necessary permissions to work on the salsa repository.

What do you think? I know that this would mean moving the Debian
packaging workflow outside of your Arrow repository, but I think it
would make life easier in the long run.

Another option would be that you just send me a source package for each
version you'd like to see uploaded (*.orig.tar.gz, *.dsc and
*.debian.tar.xz) and I would use that for review and upload. But then
any change that I might want to do would need to eventually be fed back
into your upstream repository, and I think we can do without the extra
round-trip if we keep everything Debian-related in one place.

[...]
>> Is the LICENSE.txt in the Arrow source root directory complete and lists
>> _all_ third-party licenses and copyright holders in the release tarball?
> 
> No. Most of them are covered but some of them only exists in
> source code such as
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/mman.h .

Okay, looks like I'd have to actually look through everything, gathering
and documenting licenses. Might take a while :D

Thanks
Sascha


[0]
https://packages.debian.org/search?suite=default&section=all&arch=any&searchon=names&keywords=arrow
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=970021
[2] https://salsa.debian.org/satta/arrow
diff -ru ./control /home/satta/pkg/dcso/arrow/arrow/debian/control
--- ./control	2021-04-21 18:11:38.000000000 +0200
+++ /home/satta/pkg/dcso/arrow/arrow/debian/control	2021-05-18 22:44:16.705459282 +0200
@@ -4,7 +4,7 @@
 Maintainer: Apache Arrow Developers <dev@arrow.apache.org>
 Build-Depends:
   cmake,
-  debhelper (>= 12),
+  debhelper (>= 13),
   git,
   gobject-introspection,
   gtk-doc-tools,
@@ -26,11 +26,21 @@
   ninja-build,
   nvidia-cuda-toolkit [!arm64],
   pkg-config,
+  libprotobuf-dev,
+  libprotoc-dev,
   protobuf-compiler-grpc,
   python3-dev,
   python3-numpy,
+  rapidjson-dev,
+  libjemalloc-dev,
+  libthrift-dev,
   tzdata,
-  zlib1g-dev
+  zlib1g-dev,
+  clang,
+  clang-tidy,
+  llvm,
+  llvm-dev,
+  meson
 Build-Depends-Indep: libglib2.0-doc
 Standards-Version: 3.9.8
 Homepage: https://arrow.apache.org/
@@ -53,9 +63,9 @@
 Multi-Arch: same
 Pre-Depends: ${misc:Pre-Depends}
 Depends:
-  ${misc:Depends},
-  ${shlibs:Depends},
-  libarrow400 (= ${binary:Version})
+ ${misc:Depends},
+ ${shlibs:Depends},
+ libarrow400 (= ${binary:Version})
 Description: Apache Arrow is a data processing library for analysis
  .
  This package provides C++ library files for CUDA support.
@@ -133,6 +143,9 @@
   libutf8proc-dev,
   libzstd-dev,
   protobuf-compiler-grpc,
+  libprotobuf-dev,
+  libprotoc-dev,
+  libthrift-dev,
   zlib1g-dev
 Description: Apache Arrow is a data processing library for analysis
  .
@@ -143,9 +156,9 @@
 Architecture: i386 amd64
 Multi-Arch: same
 Depends:
-  ${misc:Depends},
-  libarrow-dev (= ${binary:Version}),
-  libarrow-cuda400 (= ${binary:Version})
+ ${misc:Depends},
+ libarrow-dev (= ${binary:Version}),
+ libarrow-cuda400 (= ${binary:Version})
 Description: Apache Arrow is a data processing library for analysis
  .
  This package provides C++ header files for CUDA support.
@@ -206,11 +219,10 @@
 Multi-Arch: same
 Pre-Depends: ${misc:Pre-Depends}
 Depends:
-  ${misc:Depends},
-  ${shlibs:Depends},
-  libarrow400 (= ${binary:Version})
-Description: Gandiva is a toolset for compiling and evaluating expressions
- on Arrow Data.
+ ${misc:Depends},
+ ${shlibs:Depends},
+ libarrow400 (= ${binary:Version})
+Description: Gandiva is a toolset for compiling and evaluating expressions on Arrow Data.
  .
  This package provides C++ library files.
 
@@ -222,8 +234,7 @@
   ${misc:Depends},
   libarrow-dev (= ${binary:Version}),
   libgandiva400 (= ${binary:Version})
-Description: Gandiva is a toolset for compiling and evaluating expressions
- on Arrow Data.
+Description: Gandiva is a toolset for compiling and evaluating expressions on Arrow Data.
  .
  This package provides C++ header files.
 
diff -ru ./libarrow-dev.install /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-dev.install
--- ./libarrow-dev.install	2021-04-21 18:11:38.000000000 +0200
+++ /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-dev.install	2021-05-18 22:01:42.307671397 +0200
@@ -7,6 +7,7 @@
 usr/lib/*/cmake/arrow/FindBrotli.cmake
 usr/lib/*/cmake/arrow/FindLz4.cmake
 usr/lib/*/cmake/arrow/FindSnappy.cmake
+usr/lib/*/cmake/arrow/FindThrift.cmake
 usr/lib/*/cmake/arrow/Findutf8proc.cmake
 usr/lib/*/cmake/arrow/Findzstd.cmake
 usr/lib/*/cmake/arrow/arrow-config.cmake
@@ -17,5 +18,5 @@
 usr/lib/*/pkgconfig/arrow-csv.pc
 usr/lib/*/pkgconfig/arrow-filesystem.pc
 usr/lib/*/pkgconfig/arrow-json.pc
-usr/lib/*/pkgconfig/arrow-orc.pc
+#usr/lib/*/pkgconfig/arrow-orc.pc
 usr/lib/*/pkgconfig/arrow.pc
diff -ru ./libarrow-glib-dev.install /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-glib-dev.install
--- ./libarrow-glib-dev.install	2021-04-21 18:11:38.000000000 +0200
+++ /home/satta/pkg/dcso/arrow/arrow/debian/libarrow-glib-dev.install	2021-05-18 21:36:48.805671734 +0200
@@ -1,6 +1,6 @@
 usr/include/arrow-glib/
 usr/lib/*/libarrow-glib.so
 usr/lib/*/pkgconfig/arrow-glib.pc
-usr/lib/*/pkgconfig/arrow-orc-glib.pc
+#usr/lib/*/pkgconfig/arrow-orc-glib.pc
 usr/share/arrow-glib/example/
 usr/share/gir-1.0/Arrow-1.0.gir
Only in /home/satta/pkg/dcso/arrow/arrow/debian/patches: fix-llvm-requirement.patch
diff -ru ./patches/series /home/satta/pkg/dcso/arrow/arrow/debian/patches/series
--- ./patches/series	2021-04-21 18:11:38.000000000 +0200
+++ /home/satta/pkg/dcso/arrow/arrow/debian/patches/series	2021-05-16 15:28:28.184586150 +0200
@@ -0,0 +1 @@
+fix-llvm-requirement.patch
diff -ru ./rules /home/satta/pkg/dcso/arrow/arrow/debian/rules
--- ./rules	2021-04-21 18:11:38.000000000 +0200
+++ /home/satta/pkg/dcso/arrow/arrow/debian/rules	2021-05-18 21:20:26.778902537 +0200
@@ -14,28 +14,28 @@
 	dh $@ --with gir
 
 override_dh_auto_configure:
-	if dpkg -l nvidia-cuda-toolkit > /dev/null 2>&1; then	\
-	  ARROW_CUDA=ON;					\
-	  ARROW_PLASMA=ON;					\
-	else							\
-	  ARROW_CUDA=OFF;					\
-	  ARROW_PLASMA=OFF;					\
-	fi;							\
-	dh_auto_configure					\
+	dh_auto_configure						\
 	  --sourcedirectory=cpp					\
-	  --builddirectory=cpp_build				\
+	  --builddirectory=cpp_build			\
 	  --buildsystem=cmake+ninja				\
-	  --							\
-	  -DARROW_CUDA=$${ARROW_CUDA}				\
-	  -DARROW_FLIGHT=ON					\
+	  --									\
+	  -DARROW_CUDA=ON						\
+	  -DARROW_CSV=ON 						\
+	  -DARROW_COMPUTE=ON 					\
+	  -DARROW_DATASET=ON 					\
+	  -DARROW_DEPENDENCY_SOURCE=SYSTEM		\
+	  -DARROW_JSON=ON 						\
+	  -DARROW_FILESYSTEM=ON					\
+	  -DARROW_FLIGHT=ON						\
 	  -DARROW_GANDIVA=ON					\
 	  -DARROW_GANDIVA_JAVA=OFF				\
-	  -DARROW_MIMALLOC=ON					\
-	  -DARROW_ORC=ON					\
+	  -DARROW_JEMALLOC=ON					\
+	  -DARROW_MIMALLOC=OFF					\
+	  -DARROW_ORC=OFF						\
 	  -DARROW_PARQUET=ON					\
-	  -DARROW_PLASMA=$${ARROW_PLASMA}			\
-	  -DARROW_PYTHON=ON					\
-	  -DARROW_S3=ON						\
+	  -DARROW_PLASMA=ON						\
+	  -DARROW_PYTHON=ON						\
+	  -DARROW_S3=OFF						\
 	  -DARROW_USE_CCACHE=OFF				\
 	  -DARROW_WITH_BROTLI=ON				\
 	  -DARROW_WITH_BZ2=ON					\
@@ -43,11 +43,7 @@
 	  -DARROW_WITH_SNAPPY=ON				\
 	  -DARROW_WITH_ZLIB=ON					\
 	  -DARROW_WITH_ZSTD=ON					\
-	  -DCMAKE_BUILD_TYPE=$(BUILD_TYPE)			\
-	  -DCMAKE_UNITY_BUILD=ON				\
-	  -DPARQUET_REQUIRE_ENCRYPTION=ON			\
-	  -DPythonInterp_FIND_VERSION=ON			\
-	  -DPythonInterp_FIND_VERSION_MAJOR=3
+	  -DCMAKE_BUILD_TYPE=$(BUILD_TYPE)
 
 override_dh_auto_build:
 	dh_auto_build				\
@@ -88,15 +84,7 @@
 	  --builddirectory=cpp_build
 
 override_dh_auto_test:
-	# TODO: We need Boost 1.64 or later to build tests for
-	# Apache Arrow Flight.
-	# git clone --depth 1 https://github.com/apache/arrow-testing.git
-	# git clone --depth 1 https://github.com/apache/parquet-testing.git
-	# cd cpp_build &&								\
-	#   env									\
-	#     ARROW_TEST_DATA=$(CURDIR)/arrow-testing/data			\
-	#     PARQUET_TEST_DATA=$(CURDIR)/parquet-testing/data			\
-	#       ctest --exclude-regex 'arrow-cuda-test|plasma-client_tests'
+	# pass
 
 # skip file failing with "Unknown DWARF DW_OP_255" (see bug#949296)
 override_dh_dwz:

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to