Dear Apache Kudu and Apache Impala (incubating) communities,

(I'm not sure the best way to have a cross-list discussion, so I
apologize if this does not work well)

On the recent Apache Parquet sync call, we discussed C++ code sharing
between the codebases in Apache Arrow and Apache Parquet, and
opportunities for more code sharing with Kudu and Impala as well.

As context

* We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
first C++ release within Apache Parquet. I got involved with this
project a little over a year ago and was faced with the unpleasant
decision to copy and paste a significant amount of code out of
Impala's codebase to bootstrap the project.

* In parallel, we begin the Apache Arrow project, which is designed to
be a complementary library for file formats (like Parquet), storage
engines (like Kudu), and compute engines (like Impala and pandas).

* As Arrow and parquet-cpp matured, an increasing amount of code
overlap crept up surrounding buffer memory management and IO
interface. We recently decided in PARQUET-818
(https://github.com/apache/parquet-cpp/commit/2154e873d5aa7280314189a2683fb1e12a590c02)
to remove some of the obvious code overlap in Parquet and make
libarrow.a/so a hard compile and link-time dependency for
libparquet.a/so.

* There is still quite a bit of code in parquet-cpp that would better
fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
compression, bit utilities, and so forth. Much of this code originated
from Impala

This brings me to a next set of points:

* parquet-cpp contains quite a bit of code that was extracted from
Impala. This is mostly self-contained in
https://github.com/apache/parquet-cpp/tree/master/src/parquet/util

* My understanding is that Kudu extracted certain computational
utilities from Impala in its early days, but these tools have likely
diverged as the needs of the projects have evolved.

Since all of these projects are quite different in their end goals
(runtime systems vs. libraries), touching code that is tightly coupled
to either Kudu or Impala's runtimes is probably not worth discussing.
However, I think there is a strong basis for collaboration on
computational utilities and vectorized array processing. Some obvious
areas that come to mind:

* SIMD utilities (for hashing or processing of preallocated contiguous memory)
* Array encoding utilities: RLE / Dictionary, etc.
* Bit manipulation (packing and unpacking, e.g. Daniel Lemire
contributed a patch to parquet-cpp around this)
* Date and time utilities
* Compression utilities

I hope the benefits are obvious: consolidating efforts on unit
testing, benchmarking, performance optimizations, continuous
integration, and platform compatibility.

Logistically speaking, one possible avenue might be to use Apache
Arrow as the place to assemble this code. Its thirdparty toolchain is
small, and it builds and installs fast. It is intended as a library to
have its headers used and linked against other applications. (As an
aside, I'm very interested in building optional support for Arrow
columnar messages into the kudu client).

The downside of code sharing, which may have prevented it so far, are
the logistics of coordinating ASF release cycles and keeping build
toolchains in sync. It's taken us the past year to stabilize the
design of Arrow for its intended use cases, so at this point if we
went down this road I would be OK with helping the community commit to
a regular release cadence that would be faster than Impala, Kudu, and
Parquet's respective release cadences. Since members of the Kudu and
Impala PMC are also on the Arrow PMC, I trust we would be able to
collaborate to each other's mutual benefit and success.

Note that Arrow does not throw C++ exceptions and similarly follows
Google C++ style guide to the same extent at Kudu and Impala.

If this is something that either the Kudu or Impala communities would
like to pursue in earnest, I would be happy to work with you on next
steps. I would suggest that we start with something small so that we
could address the necessary build toolchain changes, and develop a
workflow for moving around code and tests, a protocol for code reviews
(e.g. Gerrit), and coordinating ASF releases.

Let me know what you think.

best
Wes

Reply via email to