Thanks for bringing this up, Wes. On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote:
> Dear Apache Kudu and Apache Impala (incubating) communities, > > (I'm not sure the best way to have a cross-list discussion, so I > apologize if this does not work well) > > On the recent Apache Parquet sync call, we discussed C++ code sharing > between the codebases in Apache Arrow and Apache Parquet, and > opportunities for more code sharing with Kudu and Impala as well. > > As context > > * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the > first C++ release within Apache Parquet. I got involved with this > project a little over a year ago and was faced with the unpleasant > decision to copy and paste a significant amount of code out of > Impala's codebase to bootstrap the project. > > * In parallel, we begin the Apache Arrow project, which is designed to > be a complementary library for file formats (like Parquet), storage > engines (like Kudu), and compute engines (like Impala and pandas). > > * As Arrow and parquet-cpp matured, an increasing amount of code > overlap crept up surrounding buffer memory management and IO > interface. We recently decided in PARQUET-818 > (https://github.com/apache/parquet-cpp/commit/ > 2154e873d5aa7280314189a2683fb1e12a590c02) > to remove some of the obvious code overlap in Parquet and make > libarrow.a/so a hard compile and link-time dependency for > libparquet.a/so. > > * There is still quite a bit of code in parquet-cpp that would better > fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding, > compression, bit utilities, and so forth. Much of this code originated > from Impala > > This brings me to a next set of points: > > * parquet-cpp contains quite a bit of code that was extracted from > Impala. This is mostly self-contained in > https://github.com/apache/parquet-cpp/tree/master/src/parquet/util > > * My understanding is that Kudu extracted certain computational > utilities from Impala in its early days, but these tools have likely > diverged as the needs of the projects have evolved. > > Since all of these projects are quite different in their end goals > (runtime systems vs. libraries), touching code that is tightly coupled > to either Kudu or Impala's runtimes is probably not worth discussing. > However, I think there is a strong basis for collaboration on > computational utilities and vectorized array processing. Some obvious > areas that come to mind: > > * SIMD utilities (for hashing or processing of preallocated contiguous > memory) > * Array encoding utilities: RLE / Dictionary, etc. > * Bit manipulation (packing and unpacking, e.g. Daniel Lemire > contributed a patch to parquet-cpp around this) > * Date and time utilities > * Compression utilities > Between Kudu and Impala (at least) there are many more opportunities for sharing. Threads, logging, metrics, concurrent primitives - the list is quite long. > > I hope the benefits are obvious: consolidating efforts on unit > testing, benchmarking, performance optimizations, continuous > integration, and platform compatibility. > > Logistically speaking, one possible avenue might be to use Apache > Arrow as the place to assemble this code. Its thirdparty toolchain is > small, and it builds and installs fast. It is intended as a library to > have its headers used and linked against other applications. (As an > aside, I'm very interested in building optional support for Arrow > columnar messages into the kudu client). > In principle I'm in favour of code sharing, and it seems very much in keeping with the Apache way. However, practically speaking I'm of the opinion that it only makes sense to house shared support code in a separate, dedicated project. Embedding the shared libraries in, e.g., Arrow naturally limits the scope of sharing to utilities that Arrow is interested in. It would make no sense to add a threading library to Arrow if it was never used natively. Muddying the waters of the project's charter seems likely to lead to user, and developer, confusion. Similarly, we should not necessarily couple Arrow's design goals to those it inherits from Kudu and Impala's source code. I think I'd rather see a new Apache project than re-use a current one for two independent purposes. > > The downside of code sharing, which may have prevented it so far, are > the logistics of coordinating ASF release cycles and keeping build > toolchains in sync. It's taken us the past year to stabilize the > design of Arrow for its intended use cases, so at this point if we > went down this road I would be OK with helping the community commit to > a regular release cadence that would be faster than Impala, Kudu, and > Parquet's respective release cadences. Since members of the Kudu and > Impala PMC are also on the Arrow PMC, I trust we would be able to > collaborate to each other's mutual benefit and success. > > Note that Arrow does not throw C++ exceptions and similarly follows > Google C++ style guide to the same extent at Kudu and Impala. > > If this is something that either the Kudu or Impala communities would > like to pursue in earnest, I would be happy to work with you on next > steps. I would suggest that we start with something small so that we > could address the necessary build toolchain changes, and develop a > workflow for moving around code and tests, a protocol for code reviews > (e.g. Gerrit), and coordinating ASF releases. > I think, if I'm reading this correctly, that you're assuming integration with the 'downstream' projects (e.g. Impala and Kudu) would be done via their toolchains. For something as fast moving as utility code - and critical, where you want the latency between adding a fix and including it in your build to be ~0 - that's a non-starter to me, at least with how the toolchains are currently realised. I'd rather have the source code directly imported into Impala's tree - whether by git submodule or other mechanism. That way the coupling is looser, and we can move more quickly. I think that's important to other projects as well. Henry > > Let me know what you think. > > best > Wes >