Thanks and great job driving this, Jacques!
On Wed, Jan 20, 2016 at 3:28 PM, Jacques Nadeau <jacq...@dremio.com> wrote: > Hey Everyone, > > Good news! The Apache board has approved the Apache Arrow as a new TLP. > I've asked the Apache INFRA team to set up required resources so we can > start moving forward (ML, Git, Website, etc). > > I've started working on a press release to announce the Apache Arrow > project and will circulate a draft shortly. Once the project mailing lists > are established, we can move this thread over there to continue > discussions. They had us do one of change to the proposal during the board > call which was to remove the initial committers (separate from initial > pmc). Once we establish the PMC list, we can immediately add the additional > committers as our first PMC action. > > thanks to everyone! > Jacques > > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <jul...@dremio.com> wrote: > >> +1 on a repo for the spec. >> I do have questions as well. >> In particular for the metadata. >> >> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <w...@cloudera.com> wrote: >> >>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <par...@apache.org> >>> wrote: >>> > >>> > >>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <w...@cloudera.com> >>> wrote: >>> >> >>> >> >>> >> > >>> >> > As far as the existing work is concerned, I'm not sure everyone is >>> aware >>> >> > of >>> >> > the C++ code inside of Drill that can represent at least the scalar >>> >> > types in >>> >> > Drill's existing Value Vectors [1]. This is currently used by the >>> native >>> >> > client written to hook up an ODBC driver. >>> >> > >>> >> >>> >> I have read this code. From my perspective, it would be less work to >>> >> collaborate on a self-contained implementation that closely models the >>> >> Arrow / VV spec that includes builder classes and its own memory >>> >> management without coupling to Drill details. I started prototyping >>> >> something here (warning: only a few actual days of coding here): >>> >> >>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow >>> >> >>> >> For example, you can see an example constructing an Array<Int32> or >>> >> String (== Array<UInt8>) column in the tests here >>> >> >>> >> >>> >> >>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328 >>> >> >>> >> I've been planning to use this as the basis of a C++ Parquet >>> >> reader-writer and the associated Python pandas-like layer which >>> >> includes in-memory analytics on Arrow data structures. >>> >> >>> >> > Parth who is included here has been the primary owner of this C++ >>> code >>> >> > throughout it's life in Drill. Parth, what do you think is the best >>> >> > strategy >>> >> > for managing the C++ code right now? As the C++ build is not tied >>> into >>> >> > the >>> >> > Java one, as I understand it we just run it manually when updates >>> are >>> >> > made >>> >> > there and we need to update ODBC. Would it be disruptive to move the >>> >> > code to >>> >> > the arrow repo? If so, we could include Drill as a submodule in the >>> new >>> >> > repo, or put Wes's work so far in the Drill repo. >>> >> >>> >> If we can enumerate the non-Drill-client related parts (i.e. the array >>> >> accessors and data structures-oriented code) that would make sense in >>> >> a standalone Arrow library it would be great to start a side >>> >> discussion about the design of the C++ reference implementation >>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since >>> >> this is a quite urgent for me (intending to deliver a minimally viable >>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it >>> >> would be great to do this sooner rather than later. >>> >> >>> > >>> > Most of the code for Drill C++ Value Vectors is independent of Drill - >>> > mostly the code upto line 787 in this file - >>> > >>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp >>> > >>> > My thought was to leave the Drill implementation alone and borrow >>> copiously >>> > from it when convenient for Arrow. Seems like we can still do that >>> building >>> > on Wes' work. >>> > >>> >>> Makes sense. Speaking of code, would you all like me to set up a >>> temporary repo for the specification itself? I already have a few >>> questions like how and where to track array null counts. >>> >>> > Wes, let me know if you want to have a quick hangout on this. >>> > >>> >>> Sure, I'll follow up separately to get something on the calendar. >>> Looking forward to connecting! >>> >>> > Parth >>> > >>> > >>> >> >> >> >> -- >> Julien >> > >