Re: Naming the new ValueVector Initiative

Wes McKinney Tue, 12 Jan 2016 19:00:07 -0800

On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <[email protected]> wrote:
>
>
> On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <[email protected]> wrote:
>>
>>
>> >
>> > As far as the existing work is concerned, I'm not sure everyone is aware
>> > of
>> > the C++ code inside of Drill that can represent at least the scalar
>> > types in
>> > Drill's existing Value Vectors [1]. This is currently used by the native
>> > client written to hook up an ODBC driver.
>> >
>>
>> I have read this code. From my perspective, it would be less work to
>> collaborate on a self-contained implementation that closely models the
>> Arrow / VV spec that includes builder classes and its own memory
>> management without coupling to Drill details. I started prototyping
>> something here (warning: only a few actual days of coding here):
>>
>> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow
>>
>> For example, you can see an example constructing an Array<Int32> or
>> String (== Array<UInt8>) column in the tests here
>>
>>
>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328
>>
>> I've been planning to use this as the basis of a C++ Parquet
>> reader-writer and the associated Python pandas-like layer which
>> includes in-memory analytics on Arrow data structures.
>>
>> > Parth who is included here has been the primary owner of this C++ code
>> > throughout it's life in Drill. Parth, what do you think is the best
>> > strategy
>> > for managing the C++ code right now? As the C++ build is not tied into
>> > the
>> > Java one, as I understand it we just run it manually when updates are
>> > made
>> > there and we need to update ODBC. Would it be disruptive to move the
>> > code to
>> > the arrow repo? If so, we could include Drill as a submodule in the new
>> > repo, or put Wes's work so far in the Drill repo.
>>
>> If we can enumerate the non-Drill-client related parts (i.e. the array
>> accessors and data structures-oriented code) that would make sense in
>> a standalone Arrow library it would be great to start a side
>> discussion about the design of the C++ reference implementation
>> (metadata / schemas, IPC, array builders and accessors, etc.). Since
>> this is a quite urgent for me (intending to deliver a minimally viable
>> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it
>> would be great to do this sooner rather than later.
>>
>
> Most of the code for  Drill C++ Value Vectors is independent of Drill -
> mostly the code upto line 787 in this file -
> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp
>
> My thought was to leave the Drill implementation alone and borrow copiously
> from it when convenient for Arrow. Seems like we can still do that building
> on Wes' work.
>


Makes sense. Speaking of code, would you all like me to set up a
temporary repo for the specification itself? I already have a few
questions like how and where to track array null counts.

> Wes, let me know if you want to have a quick hangout on this.
>

Sure, I'll follow up separately to get something on the calendar.
Looking forward to connecting!

> Parth
>
>

Re: Naming the new ValueVector Initiative

Reply via email to