Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Wes McKinney Sun, 26 Feb 2017 11:25:55 -0800

hi Miki,

No, I don't think so. APR is a portable C library. The code we are
talking about would be intended for use in C++11/14 projects like
Impala and Kudu (and Arrow and Parquet).


Wes

On Sun, Feb 26, 2017 at 1:58 PM, Miki Tebeka <miki.teb...@gmail.com> wrote:
> Can't some (most) of it be added to APR <https://apr.apache.org/>?
>
> On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
>> hi Henry,
>>
>> Thank you for these comments.
>>
>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>> ideal (though perhaps initially more labor intensive) solution.
>> There's code in Arrow that I would move into this project if it
>> existed. I am happy to help make this happen if there is interest from
>> the Kudu and Impala communities. I am not sure logistically what would
>> be the most expedient way to establish the project, whether as an ASF
>> Incubator project or possibly as a new TLP that could be created by
>> spinning IP out of Apache Kudu.
>>
>> I'm interested to hear the opinions of others, and possible next steps.
>>
>> Thanks
>> Wes
>>
>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
>> > Thanks for bringing this up, Wes.
>> >
>> > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote:
>> >
>> >> Dear Apache Kudu and Apache Impala (incubating) communities,
>> >>
>> >> (I'm not sure the best way to have a cross-list discussion, so I
>> >> apologize if this does not work well)
>> >>
>> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
>> >> between the codebases in Apache Arrow and Apache Parquet, and
>> >> opportunities for more code sharing with Kudu and Impala as well.
>> >>
>> >> As context
>> >>
>> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>> >> first C++ release within Apache Parquet. I got involved with this
>> >> project a little over a year ago and was faced with the unpleasant
>> >> decision to copy and paste a significant amount of code out of
>> >> Impala's codebase to bootstrap the project.
>> >>
>> >> * In parallel, we begin the Apache Arrow project, which is designed to
>> >> be a complementary library for file formats (like Parquet), storage
>> >> engines (like Kudu), and compute engines (like Impala and pandas).
>> >>
>> >> * As Arrow and parquet-cpp matured, an increasing amount of code
>> >> overlap crept up surrounding buffer memory management and IO
>> >> interface. We recently decided in PARQUET-818
>> >> (https://github.com/apache/parquet-cpp/commit/
>> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
>> >> to remove some of the obvious code overlap in Parquet and make
>> >> libarrow.a/so a hard compile and link-time dependency for
>> >> libparquet.a/so.
>> >>
>> >> * There is still quite a bit of code in parquet-cpp that would better
>> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
>> >> compression, bit utilities, and so forth. Much of this code originated
>> >> from Impala
>> >>
>> >> This brings me to a next set of points:
>> >>
>> >> * parquet-cpp contains quite a bit of code that was extracted from
>> >> Impala. This is mostly self-contained in
>> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>> >>
>> >> * My understanding is that Kudu extracted certain computational
>> >> utilities from Impala in its early days, but these tools have likely
>> >> diverged as the needs of the projects have evolved.
>> >>
>> >> Since all of these projects are quite different in their end goals
>> >> (runtime systems vs. libraries), touching code that is tightly coupled
>> >> to either Kudu or Impala's runtimes is probably not worth discussing.
>> >> However, I think there is a strong basis for collaboration on
>> >> computational utilities and vectorized array processing. Some obvious
>> >> areas that come to mind:
>> >>
>> >> * SIMD utilities (for hashing or processing of preallocated contiguous
>> >> memory)
>> >> * Array encoding utilities: RLE / Dictionary, etc.
>> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>> >> contributed a patch to parquet-cpp around this)
>> >> * Date and time utilities
>> >> * Compression utilities
>> >>
>> >
>> > Between Kudu and Impala (at least) there are many more opportunities for
>> > sharing. Threads, logging, metrics, concurrent primitives - the list is
>> > quite long.
>> >
>> >
>> >>
>> >> I hope the benefits are obvious: consolidating efforts on unit
>> >> testing, benchmarking, performance optimizations, continuous
>> >> integration, and platform compatibility.
>> >>
>> >> Logistically speaking, one possible avenue might be to use Apache
>> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
>> >> small, and it builds and installs fast. It is intended as a library to
>> >> have its headers used and linked against other applications. (As an
>> >> aside, I'm very interested in building optional support for Arrow
>> >> columnar messages into the kudu client).
>> >>
>> >
>> > In principle I'm in favour of code sharing, and it seems very much in
>> > keeping with the Apache way. However, practically speaking I'm of the
>> > opinion that it only makes sense to house shared support code in a
>> > separate, dedicated project.
>> >
>> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
>> > of sharing to utilities that Arrow is interested in. It would make no
>> sense
>> > to add a threading library to Arrow if it was never used natively.
>> Muddying
>> > the waters of the project's charter seems likely to lead to user, and
>> > developer, confusion. Similarly, we should not necessarily couple Arrow's
>> > design goals to those it inherits from Kudu and Impala's source code.
>> >
>> > I think I'd rather see a new Apache project than re-use a current one for
>> > two independent purposes.
>> >
>> >
>> >>
>> >> The downside of code sharing, which may have prevented it so far, are
>> >> the logistics of coordinating ASF release cycles and keeping build
>> >> toolchains in sync. It's taken us the past year to stabilize the
>> >> design of Arrow for its intended use cases, so at this point if we
>> >> went down this road I would be OK with helping the community commit to
>> >> a regular release cadence that would be faster than Impala, Kudu, and
>> >> Parquet's respective release cadences. Since members of the Kudu and
>> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
>> >> collaborate to each other's mutual benefit and success.
>> >>
>> >> Note that Arrow does not throw C++ exceptions and similarly follows
>> >> Google C++ style guide to the same extent at Kudu and Impala.
>> >>
>> >> If this is something that either the Kudu or Impala communities would
>> >> like to pursue in earnest, I would be happy to work with you on next
>> >> steps. I would suggest that we start with something small so that we
>> >> could address the necessary build toolchain changes, and develop a
>> >> workflow for moving around code and tests, a protocol for code reviews
>> >> (e.g. Gerrit), and coordinating ASF releases.
>> >>
>> >
>> > I think, if I'm reading this correctly, that you're assuming integration
>> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
>> > their toolchains. For something as fast moving as utility code - and
>> > critical, where you want the latency between adding a fix and including
>> it
>> > in your build to be ~0 - that's a non-starter to me, at least with how
>> the
>> > toolchains are currently realised.
>> >
>> > I'd rather have the source code directly imported into Impala's tree -
>> > whether by git submodule or other mechanism. That way the coupling is
>> > looser, and we can move more quickly. I think that's important to other
>> > projects as well.
>> >
>> > Henry
>> >
>> >
>> >
>> >>
>> >> Let me know what you think.
>> >>
>> >> best
>> >> Wes
>> >>
>>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Reply via email to