Re: Naming the new ValueVector Initiative

Reynold Xin Wed, 20 Jan 2016 15:30:24 -0800

Thanks and great job driving this, Jacques!


On Wed, Jan 20, 2016 at 3:28 PM, Jacques Nadeau <[email protected]> wrote:

> Hey Everyone,
>
> Good news! The Apache board has approved the Apache Arrow as a new TLP.
> I've asked the Apache INFRA team to set up required resources so we can
> start moving forward (ML, Git, Website, etc).
>
> I've started working on a press release to announce the Apache Arrow
> project and will circulate a draft shortly. Once the project mailing lists
> are established, we can move this thread over there to continue
> discussions. They had us do one of change to the proposal during the board
> call which was to remove the initial committers (separate from initial
> pmc). Once we establish the PMC list, we can immediately add the additional
> committers as our first PMC action.
>
> thanks to everyone!
> Jacques
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <[email protected]> wrote:
>
>> +1 on a repo for the spec.
>> I do have questions as well.
>> In particular for the metadata.
>>
>> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <[email protected]> wrote:
>>
>>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <[email protected]>
>>> wrote:
>>> >
>>> >
>>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <[email protected]>
>>> wrote:
>>> >>
>>> >>
>>> >> >
>>> >> > As far as the existing work is concerned, I'm not sure everyone is
>>> aware
>>> >> > of
>>> >> > the C++ code inside of Drill that can represent at least the scalar
>>> >> > types in
>>> >> > Drill's existing Value Vectors [1]. This is currently used by the
>>> native
>>> >> > client written to hook up an ODBC driver.
>>> >> >
>>> >>
>>> >> I have read this code. From my perspective, it would be less work to
>>> >> collaborate on a self-contained implementation that closely models the
>>> >> Arrow / VV spec that includes builder classes and its own memory
>>> >> management without coupling to Drill details. I started prototyping
>>> >> something here (warning: only a few actual days of coding here):
>>> >>
>>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow
>>> >>
>>> >> For example, you can see an example constructing an Array<Int32> or
>>> >> String (== Array<UInt8>) column in the tests here
>>> >>
>>> >>
>>> >>
>>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328
>>> >>
>>> >> I've been planning to use this as the basis of a C++ Parquet
>>> >> reader-writer and the associated Python pandas-like layer which
>>> >> includes in-memory analytics on Arrow data structures.
>>> >>
>>> >> > Parth who is included here has been the primary owner of this C++
>>> code
>>> >> > throughout it's life in Drill. Parth, what do you think is the best
>>> >> > strategy
>>> >> > for managing the C++ code right now? As the C++ build is not tied
>>> into
>>> >> > the
>>> >> > Java one, as I understand it we just run it manually when updates
>>> are
>>> >> > made
>>> >> > there and we need to update ODBC. Would it be disruptive to move the
>>> >> > code to
>>> >> > the arrow repo? If so, we could include Drill as a submodule in the
>>> new
>>> >> > repo, or put Wes's work so far in the Drill repo.
>>> >>
>>> >> If we can enumerate the non-Drill-client related parts (i.e. the array
>>> >> accessors and data structures-oriented code) that would make sense in
>>> >> a standalone Arrow library it would be great to start a side
>>> >> discussion about the design of the C++ reference implementation
>>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since
>>> >> this is a quite urgent for me (intending to deliver a minimally viable
>>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it
>>> >> would be great to do this sooner rather than later.
>>> >>
>>> >
>>> > Most of the code for  Drill C++ Value Vectors is independent of Drill -
>>> > mostly the code upto line 787 in this file -
>>> >
>>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp
>>> >
>>> > My thought was to leave the Drill implementation alone and borrow
>>> copiously
>>> > from it when convenient for Arrow. Seems like we can still do that
>>> building
>>> > on Wes' work.
>>> >
>>>
>>> Makes sense. Speaking of code, would you all like me to set up a
>>> temporary repo for the specification itself? I already have a few
>>> questions like how and where to track array null counts.
>>>
>>> > Wes, let me know if you want to have a quick hangout on this.
>>> >
>>>
>>> Sure, I'll follow up separately to get something on the calendar.
>>> Looking forward to connecting!
>>>
>>> > Parth
>>> >
>>> >
>>>
>>
>>
>>
>> --
>> Julien
>>
>
>

Re: Naming the new ValueVector Initiative

Reply via email to