Re: Naming the new ValueVector Initiative

Jacques Nadeau Wed, 20 Jan 2016 20:15:07 -0800

Hey guys, one other note on PR. To make arrow known and have a good launch,
it is best to not announce anything publicly until the press release comes
out. Will send out a press release draft shortly and then we can work with
Apache PR to set an announcement date.
On Jan 20, 2016 7:15 PM, "Jacques Nadeau" <jacq...@dremio.com> wrote:


> Yep, straight to TLP.
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Wed, Jan 20, 2016 at 5:21 PM, Jake Luciani <j...@apache.org> wrote:
>
>> That's great! So it's going straight to TLP?
>> Hey Everyone,
>>
>> Good news! The Apache board has approved the Apache Arrow as a new TLP.
>> I've asked the Apache INFRA team to set up required resources so we can
>> start moving forward (ML, Git, Website, etc).
>>
>> I've started working on a press release to announce the Apache Arrow
>> project and will circulate a draft shortly. Once the project mailing lists
>> are established, we can move this thread over there to continue
>> discussions. They had us do one of change to the proposal during the board
>> call which was to remove the initial committers (separate from initial
>> pmc). Once we establish the PMC list, we can immediately add the additional
>> committers as our first PMC action.
>>
>> thanks to everyone!
>> Jacques
>>
>>
>> --
>> Jacques Nadeau
>> CTO and Co-Founder, Dremio
>>
>> On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <jul...@dremio.com>
>> wrote:
>>
>>> +1 on a repo for the spec.
>>> I do have questions as well.
>>> In particular for the metadata.
>>>
>>> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <w...@cloudera.com> wrote:
>>>
>>>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <par...@apache.org>
>>>> wrote:
>>>> >
>>>> >
>>>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <w...@cloudera.com>
>>>> wrote:
>>>> >>
>>>> >>
>>>> >> >
>>>> >> > As far as the existing work is concerned, I'm not sure everyone is
>>>> aware
>>>> >> > of
>>>> >> > the C++ code inside of Drill that can represent at least the scalar
>>>> >> > types in
>>>> >> > Drill's existing Value Vectors [1]. This is currently used by the
>>>> native
>>>> >> > client written to hook up an ODBC driver.
>>>> >> >
>>>> >>
>>>> >> I have read this code. From my perspective, it would be less work to
>>>> >> collaborate on a self-contained implementation that closely models
>>>> the
>>>> >> Arrow / VV spec that includes builder classes and its own memory
>>>> >> management without coupling to Drill details. I started prototyping
>>>> >> something here (warning: only a few actual days of coding here):
>>>> >>
>>>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow
>>>> >>
>>>> >> For example, you can see an example constructing an Array<Int32> or
>>>> >> String (== Array<UInt8>) column in the tests here
>>>> >>
>>>> >>
>>>> >>
>>>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328
>>>> >>
>>>> >> I've been planning to use this as the basis of a C++ Parquet
>>>> >> reader-writer and the associated Python pandas-like layer which
>>>> >> includes in-memory analytics on Arrow data structures.
>>>> >>
>>>> >> > Parth who is included here has been the primary owner of this C++
>>>> code
>>>> >> > throughout it's life in Drill. Parth, what do you think is the best
>>>> >> > strategy
>>>> >> > for managing the C++ code right now? As the C++ build is not tied
>>>> into
>>>> >> > the
>>>> >> > Java one, as I understand it we just run it manually when updates
>>>> are
>>>> >> > made
>>>> >> > there and we need to update ODBC. Would it be disruptive to move
>>>> the
>>>> >> > code to
>>>> >> > the arrow repo? If so, we could include Drill as a submodule in
>>>> the new
>>>> >> > repo, or put Wes's work so far in the Drill repo.
>>>> >>
>>>> >> If we can enumerate the non-Drill-client related parts (i.e. the
>>>> array
>>>> >> accessors and data structures-oriented code) that would make sense in
>>>> >> a standalone Arrow library it would be great to start a side
>>>> >> discussion about the design of the C++ reference implementation
>>>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since
>>>> >> this is a quite urgent for me (intending to deliver a minimally
>>>> viable
>>>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it
>>>> >> would be great to do this sooner rather than later.
>>>> >>
>>>> >
>>>> > Most of the code for  Drill C++ Value Vectors is independent of Drill
>>>> -
>>>> > mostly the code upto line 787 in this file -
>>>> >
>>>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp
>>>> >
>>>> > My thought was to leave the Drill implementation alone and borrow
>>>> copiously
>>>> > from it when convenient for Arrow. Seems like we can still do that
>>>> building
>>>> > on Wes' work.
>>>> >
>>>>
>>>> Makes sense. Speaking of code, would you all like me to set up a
>>>> temporary repo for the specification itself? I already have a few
>>>> questions like how and where to track array null counts.
>>>>
>>>> > Wes, let me know if you want to have a quick hangout on this.
>>>> >
>>>>
>>>> Sure, I'll follow up separately to get something on the calendar.
>>>> Looking forward to connecting!
>>>>
>>>> > Parth
>>>> >
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>> Julien
>>>
>>
>>
>

Re: Naming the new ValueVector Initiative

Reply via email to