hi Miki, No, I don't think so. APR is a portable C library. The code we are talking about would be intended for use in C++11/14 projects like Impala and Kudu (and Arrow and Parquet).
Wes On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka <miki.teb...@gmail.com> wrote: > Can't some (most) of it be added to APR <https://apr.apache.org/>? > > On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> hi Henry, >> >> Thank you for these comments. >> >> I think having a kind of "Apache Commons for [Modern] C++" would be an >> ideal (though perhaps initially more labor intensive) solution. >> There's code in Arrow that I would move into this project if it >> existed. I am happy to help make this happen if there is interest from >> the Kudu and Impala communities. I am not sure logistically what would >> be the most expedient way to establish the project, whether as an ASF >> Incubator project or possibly as a new TLP that could be created by >> spinning IP out of Apache Kudu. >> >> I'm interested to hear the opinions of others, and possible next steps. >> >> Thanks >> Wes >> >> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote: >> > Thanks for bringing this up, Wes. >> > >> > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote: >> > >> >> Dear Apache Kudu and Apache Impala (incubating) communities, >> >> >> >> (I'm not sure the best way to have a cross-list discussion, so I >> >> apologize if this does not work well) >> >> >> >> On the recent Apache Parquet sync call, we discussed C++ code sharing >> >> between the codebases in Apache Arrow and Apache Parquet, and >> >> opportunities for more code sharing with Kudu and Impala as well. >> >> >> >> As context >> >> >> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the >> >> first C++ release within Apache Parquet. I got involved with this >> >> project a little over a year ago and was faced with the unpleasant >> >> decision to copy and paste a significant amount of code out of >> >> Impala's codebase to bootstrap the project. >> >> >> >> * In parallel, we begin the Apache Arrow project, which is designed to >> >> be a complementary library for file formats (like Parquet), storage >> >> engines (like Kudu), and compute engines (like Impala and pandas). >> >> >> >> * As Arrow and parquet-cpp matured, an increasing amount of code >> >> overlap crept up surrounding buffer memory management and IO >> >> interface. We recently decided in PARQUET-818 >> >> (https://github.com/apache/parquet-cpp/commit/ >> >> 2154e873d5aa7280314189a2683fb1e12a590c02) >> >> to remove some of the obvious code overlap in Parquet and make >> >> libarrow.a/so a hard compile and link-time dependency for >> >> libparquet.a/so. >> >> >> >> * There is still quite a bit of code in parquet-cpp that would better >> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding, >> >> compression, bit utilities, and so forth. Much of this code originated >> >> from Impala >> >> >> >> This brings me to a next set of points: >> >> >> >> * parquet-cpp contains quite a bit of code that was extracted from >> >> Impala. This is mostly self-contained in >> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util >> >> >> >> * My understanding is that Kudu extracted certain computational >> >> utilities from Impala in its early days, but these tools have likely >> >> diverged as the needs of the projects have evolved. >> >> >> >> Since all of these projects are quite different in their end goals >> >> (runtime systems vs. libraries), touching code that is tightly coupled >> >> to either Kudu or Impala's runtimes is probably not worth discussing. >> >> However, I think there is a strong basis for collaboration on >> >> computational utilities and vectorized array processing. Some obvious >> >> areas that come to mind: >> >> >> >> * SIMD utilities (for hashing or processing of preallocated contiguous >> >> memory) >> >> * Array encoding utilities: RLE / Dictionary, etc. >> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire >> >> contributed a patch to parquet-cpp around this) >> >> * Date and time utilities >> >> * Compression utilities >> >> >> > >> > Between Kudu and Impala (at least) there are many more opportunities for >> > sharing. Threads, logging, metrics, concurrent primitives - the list is >> > quite long. >> > >> > >> >> >> >> I hope the benefits are obvious: consolidating efforts on unit >> >> testing, benchmarking, performance optimizations, continuous >> >> integration, and platform compatibility. >> >> >> >> Logistically speaking, one possible avenue might be to use Apache >> >> Arrow as the place to assemble this code. Its thirdparty toolchain is >> >> small, and it builds and installs fast. It is intended as a library to >> >> have its headers used and linked against other applications. (As an >> >> aside, I'm very interested in building optional support for Arrow >> >> columnar messages into the kudu client). >> >> >> > >> > In principle I'm in favour of code sharing, and it seems very much in >> > keeping with the Apache way. However, practically speaking I'm of the >> > opinion that it only makes sense to house shared support code in a >> > separate, dedicated project. >> > >> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope >> > of sharing to utilities that Arrow is interested in. It would make no >> sense >> > to add a threading library to Arrow if it was never used natively. >> Muddying >> > the waters of the project's charter seems likely to lead to user, and >> > developer, confusion. Similarly, we should not necessarily couple Arrow's >> > design goals to those it inherits from Kudu and Impala's source code. >> > >> > I think I'd rather see a new Apache project than re-use a current one for >> > two independent purposes. >> > >> > >> >> >> >> The downside of code sharing, which may have prevented it so far, are >> >> the logistics of coordinating ASF release cycles and keeping build >> >> toolchains in sync. It's taken us the past year to stabilize the >> >> design of Arrow for its intended use cases, so at this point if we >> >> went down this road I would be OK with helping the community commit to >> >> a regular release cadence that would be faster than Impala, Kudu, and >> >> Parquet's respective release cadences. Since members of the Kudu and >> >> Impala PMC are also on the Arrow PMC, I trust we would be able to >> >> collaborate to each other's mutual benefit and success. >> >> >> >> Note that Arrow does not throw C++ exceptions and similarly follows >> >> Google C++ style guide to the same extent at Kudu and Impala. >> >> >> >> If this is something that either the Kudu or Impala communities would >> >> like to pursue in earnest, I would be happy to work with you on next >> >> steps. I would suggest that we start with something small so that we >> >> could address the necessary build toolchain changes, and develop a >> >> workflow for moving around code and tests, a protocol for code reviews >> >> (e.g. Gerrit), and coordinating ASF releases. >> >> >> > >> > I think, if I'm reading this correctly, that you're assuming integration >> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via >> > their toolchains. For something as fast moving as utility code - and >> > critical, where you want the latency between adding a fix and including >> it >> > in your build to be ~0 - that's a non-starter to me, at least with how >> the >> > toolchains are currently realised. >> > >> > I'd rather have the source code directly imported into Impala's tree - >> > whether by git submodule or other mechanism. That way the coupling is >> > looser, and we can move more quickly. I think that's important to other >> > projects as well. >> > >> > Henry >> > >> > >> > >> >> >> >> Let me know what you think. >> >> >> >> best >> >> Wes >> >> >>