Re: Naming the new ValueVector Initiative

Wes McKinney Thu, 21 Jan 2016 11:54:06 -0800

Dear all — I converted our Google discussion document about the Arrow
memory layout (prior to the ASF proposal) to Markdown and placed it here


https://github.com/arrow-data/arrow-format

Thanks,
Wes

On Wed, Jan 20, 2016 at 8:14 PM, Jacques Nadeau <[email protected]> wrote:

> Hey guys, one other note on PR. To make arrow known and have a good
> launch, it is best to not announce anything publicly until the press
> release comes out. Will send out a press release draft shortly and then we
> can work with Apache PR to set an announcement date.
> On Jan 20, 2016 7:15 PM, "Jacques Nadeau" <[email protected]> wrote:
>
>> Yep, straight to TLP.
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Wed, Jan 20, 2016 at 5:21 PM, Jake Luciani <[email protected]> wrote:
>>
>>> That's great! So it's going straight to TLP?
>>> Hey Everyone,
>>>
>>> Good news! The Apache board has approved the Apache Arrow as a new TLP.
>>> I've asked the Apache INFRA team to set up required resources so we can
>>> start moving forward (ML, Git, Website, etc).
>>>
>>> I've started working on a press release to announce the Apache Arrow
>>> project and will circulate a draft shortly. Once the project mailing lists
>>> are established, we can move this thread over there to continue
>>> discussions. They had us do one of change to the proposal during the board
>>> call which was to remove the initial committers (separate from initial
>>> pmc). Once we establish the PMC list, we can immediately add the additional
>>> committers as our first PMC action.
>>>
>>> thanks to everyone!
>>> Jacques
>>>
>>>
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>>
>>> On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <[email protected]>
>>> wrote:
>>>
>>>> +1 on a repo for the spec.
>>>> I do have questions as well.
>>>> In particular for the metadata.
>>>>
>>>> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <[email protected]> wrote:
>>>>
>>>>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> >
>>>>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <[email protected]>
>>>>> wrote:
>>>>> >>
>>>>> >>
>>>>> >> >
>>>>> >> > As far as the existing work is concerned, I'm not sure everyone
>>>>> is aware
>>>>> >> > of
>>>>> >> > the C++ code inside of Drill that can represent at least the
>>>>> scalar
>>>>> >> > types in
>>>>> >> > Drill's existing Value Vectors [1]. This is currently used by the
>>>>> native
>>>>> >> > client written to hook up an ODBC driver.
>>>>> >> >
>>>>> >>
>>>>> >> I have read this code. From my perspective, it would be less work to
>>>>> >> collaborate on a self-contained implementation that closely models
>>>>> the
>>>>> >> Arrow / VV spec that includes builder classes and its own memory
>>>>> >> management without coupling to Drill details. I started prototyping
>>>>> >> something here (warning: only a few actual days of coding here):
>>>>> >>
>>>>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow
>>>>> >>
>>>>> >> For example, you can see an example constructing an Array<Int32> or
>>>>> >> String (== Array<UInt8>) column in the tests here
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328
>>>>> >>
>>>>> >> I've been planning to use this as the basis of a C++ Parquet
>>>>> >> reader-writer and the associated Python pandas-like layer which
>>>>> >> includes in-memory analytics on Arrow data structures.
>>>>> >>
>>>>> >> > Parth who is included here has been the primary owner of this C++
>>>>> code
>>>>> >> > throughout it's life in Drill. Parth, what do you think is the
>>>>> best
>>>>> >> > strategy
>>>>> >> > for managing the C++ code right now? As the C++ build is not tied
>>>>> into
>>>>> >> > the
>>>>> >> > Java one, as I understand it we just run it manually when updates
>>>>> are
>>>>> >> > made
>>>>> >> > there and we need to update ODBC. Would it be disruptive to move
>>>>> the
>>>>> >> > code to
>>>>> >> > the arrow repo? If so, we could include Drill as a submodule in
>>>>> the new
>>>>> >> > repo, or put Wes's work so far in the Drill repo.
>>>>> >>
>>>>> >> If we can enumerate the non-Drill-client related parts (i.e. the
>>>>> array
>>>>> >> accessors and data structures-oriented code) that would make sense
>>>>> in
>>>>> >> a standalone Arrow library it would be great to start a side
>>>>> >> discussion about the design of the C++ reference implementation
>>>>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since
>>>>> >> this is a quite urgent for me (intending to deliver a minimally
>>>>> viable
>>>>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months)
>>>>> it
>>>>> >> would be great to do this sooner rather than later.
>>>>> >>
>>>>> >
>>>>> > Most of the code for  Drill C++ Value Vectors is independent of
>>>>> Drill -
>>>>> > mostly the code upto line 787 in this file -
>>>>> >
>>>>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp
>>>>> >
>>>>> > My thought was to leave the Drill implementation alone and borrow
>>>>> copiously
>>>>> > from it when convenient for Arrow. Seems like we can still do that
>>>>> building
>>>>> > on Wes' work.
>>>>> >
>>>>>
>>>>> Makes sense. Speaking of code, would you all like me to set up a
>>>>> temporary repo for the specification itself? I already have a few
>>>>> questions like how and where to track array null counts.
>>>>>
>>>>> > Wes, let me know if you want to have a quick hangout on this.
>>>>> >
>>>>>
>>>>> Sure, I'll follow up separately to get something on the calendar.
>>>>> Looking forward to connecting!
>>>>>
>>>>> > Parth
>>>>> >
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Julien
>>>>
>>>
>>>
>>

Re: Naming the new ValueVector Initiative

Reply via email to