Dear Apache Kudu and Apache Impala (incubating) communities, (I'm not sure the best way to have a cross-list discussion, so I apologize if this does not work well)
On the recent Apache Parquet sync call, we discussed C++ code sharing between the codebases in Apache Arrow and Apache Parquet, and opportunities for more code sharing with Kudu and Impala as well. As context * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the first C++ release within Apache Parquet. I got involved with this project a little over a year ago and was faced with the unpleasant decision to copy and paste a significant amount of code out of Impala's codebase to bootstrap the project. * In parallel, we begin the Apache Arrow project, which is designed to be a complementary library for file formats (like Parquet), storage engines (like Kudu), and compute engines (like Impala and pandas). * As Arrow and parquet-cpp matured, an increasing amount of code overlap crept up surrounding buffer memory management and IO interface. We recently decided in PARQUET-818 (https://github.com/apache/parquet-cpp/commit/2154e873d5aa7280314189a2683fb1e12a590c02) to remove some of the obvious code overlap in Parquet and make libarrow.a/so a hard compile and link-time dependency for libparquet.a/so. * There is still quite a bit of code in parquet-cpp that would better fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding, compression, bit utilities, and so forth. Much of this code originated from Impala This brings me to a next set of points: * parquet-cpp contains quite a bit of code that was extracted from Impala. This is mostly self-contained in https://github.com/apache/parquet-cpp/tree/master/src/parquet/util * My understanding is that Kudu extracted certain computational utilities from Impala in its early days, but these tools have likely diverged as the needs of the projects have evolved. Since all of these projects are quite different in their end goals (runtime systems vs. libraries), touching code that is tightly coupled to either Kudu or Impala's runtimes is probably not worth discussing. However, I think there is a strong basis for collaboration on computational utilities and vectorized array processing. Some obvious areas that come to mind: * SIMD utilities (for hashing or processing of preallocated contiguous memory) * Array encoding utilities: RLE / Dictionary, etc. * Bit manipulation (packing and unpacking, e.g. Daniel Lemire contributed a patch to parquet-cpp around this) * Date and time utilities * Compression utilities I hope the benefits are obvious: consolidating efforts on unit testing, benchmarking, performance optimizations, continuous integration, and platform compatibility. Logistically speaking, one possible avenue might be to use Apache Arrow as the place to assemble this code. Its thirdparty toolchain is small, and it builds and installs fast. It is intended as a library to have its headers used and linked against other applications. (As an aside, I'm very interested in building optional support for Arrow columnar messages into the kudu client). The downside of code sharing, which may have prevented it so far, are the logistics of coordinating ASF release cycles and keeping build toolchains in sync. It's taken us the past year to stabilize the design of Arrow for its intended use cases, so at this point if we went down this road I would be OK with helping the community commit to a regular release cadence that would be faster than Impala, Kudu, and Parquet's respective release cadences. Since members of the Kudu and Impala PMC are also on the Arrow PMC, I trust we would be able to collaborate to each other's mutual benefit and success. Note that Arrow does not throw C++ exceptions and similarly follows Google C++ style guide to the same extent at Kudu and Impala. If this is something that either the Kudu or Impala communities would like to pursue in earnest, I would be happy to work with you on next steps. I would suggest that we start with something small so that we could address the necessary build toolchain changes, and develop a workflow for moving around code and tests, a protocol for code reviews (e.g. Gerrit), and coordinating ASF releases. Let me know what you think. best Wes