Hey guys, one other note on PR. To make arrow known and have a good launch, it is best to not announce anything publicly until the press release comes out. Will send out a press release draft shortly and then we can work with Apache PR to set an announcement date. On Jan 20, 2016 7:15 PM, "Jacques Nadeau" <jacq...@dremio.com> wrote:
> Yep, straight to TLP. > > -- > Jacques Nadeau > CTO and Co-Founder, Dremio > > On Wed, Jan 20, 2016 at 5:21 PM, Jake Luciani <j...@apache.org> wrote: > >> That's great! So it's going straight to TLP? >> Hey Everyone, >> >> Good news! The Apache board has approved the Apache Arrow as a new TLP. >> I've asked the Apache INFRA team to set up required resources so we can >> start moving forward (ML, Git, Website, etc). >> >> I've started working on a press release to announce the Apache Arrow >> project and will circulate a draft shortly. Once the project mailing lists >> are established, we can move this thread over there to continue >> discussions. They had us do one of change to the proposal during the board >> call which was to remove the initial committers (separate from initial >> pmc). Once we establish the PMC list, we can immediately add the additional >> committers as our first PMC action. >> >> thanks to everyone! >> Jacques >> >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <jul...@dremio.com> >> wrote: >> >>> +1 on a repo for the spec. >>> I do have questions as well. >>> In particular for the metadata. >>> >>> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <w...@cloudera.com> wrote: >>> >>>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <par...@apache.org> >>>> wrote: >>>> > >>>> > >>>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <w...@cloudera.com> >>>> wrote: >>>> >> >>>> >> >>>> >> > >>>> >> > As far as the existing work is concerned, I'm not sure everyone is >>>> aware >>>> >> > of >>>> >> > the C++ code inside of Drill that can represent at least the scalar >>>> >> > types in >>>> >> > Drill's existing Value Vectors [1]. This is currently used by the >>>> native >>>> >> > client written to hook up an ODBC driver. >>>> >> > >>>> >> >>>> >> I have read this code. From my perspective, it would be less work to >>>> >> collaborate on a self-contained implementation that closely models >>>> the >>>> >> Arrow / VV spec that includes builder classes and its own memory >>>> >> management without coupling to Drill details. I started prototyping >>>> >> something here (warning: only a few actual days of coding here): >>>> >> >>>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow >>>> >> >>>> >> For example, you can see an example constructing an Array<Int32> or >>>> >> String (== Array<UInt8>) column in the tests here >>>> >> >>>> >> >>>> >> >>>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328 >>>> >> >>>> >> I've been planning to use this as the basis of a C++ Parquet >>>> >> reader-writer and the associated Python pandas-like layer which >>>> >> includes in-memory analytics on Arrow data structures. >>>> >> >>>> >> > Parth who is included here has been the primary owner of this C++ >>>> code >>>> >> > throughout it's life in Drill. Parth, what do you think is the best >>>> >> > strategy >>>> >> > for managing the C++ code right now? As the C++ build is not tied >>>> into >>>> >> > the >>>> >> > Java one, as I understand it we just run it manually when updates >>>> are >>>> >> > made >>>> >> > there and we need to update ODBC. Would it be disruptive to move >>>> the >>>> >> > code to >>>> >> > the arrow repo? If so, we could include Drill as a submodule in >>>> the new >>>> >> > repo, or put Wes's work so far in the Drill repo. >>>> >> >>>> >> If we can enumerate the non-Drill-client related parts (i.e. the >>>> array >>>> >> accessors and data structures-oriented code) that would make sense in >>>> >> a standalone Arrow library it would be great to start a side >>>> >> discussion about the design of the C++ reference implementation >>>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since >>>> >> this is a quite urgent for me (intending to deliver a minimally >>>> viable >>>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) it >>>> >> would be great to do this sooner rather than later. >>>> >> >>>> > >>>> > Most of the code for Drill C++ Value Vectors is independent of Drill >>>> - >>>> > mostly the code upto line 787 in this file - >>>> > >>>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp >>>> > >>>> > My thought was to leave the Drill implementation alone and borrow >>>> copiously >>>> > from it when convenient for Arrow. Seems like we can still do that >>>> building >>>> > on Wes' work. >>>> > >>>> >>>> Makes sense. Speaking of code, would you all like me to set up a >>>> temporary repo for the specification itself? I already have a few >>>> questions like how and where to track array null counts. >>>> >>>> > Wes, let me know if you want to have a quick hangout on this. >>>> > >>>> >>>> Sure, I'll follow up separately to get something on the calendar. >>>> Looking forward to connecting! >>>> >>>> > Parth >>>> > >>>> > >>>> >>> >>> >>> >>> -- >>> Julien >>> >> >> >