Dear all — I converted our Google discussion document about the Arrow memory layout (prior to the ASF proposal) to Markdown and placed it here
https://github.com/arrow-data/arrow-format Thanks, Wes On Wed, Jan 20, 2016 at 8:14 PM, Jacques Nadeau <jacq...@dremio.com> wrote: > Hey guys, one other note on PR. To make arrow known and have a good > launch, it is best to not announce anything publicly until the press > release comes out. Will send out a press release draft shortly and then we > can work with Apache PR to set an announcement date. > On Jan 20, 2016 7:15 PM, "Jacques Nadeau" <jacq...@dremio.com> wrote: > >> Yep, straight to TLP. >> >> -- >> Jacques Nadeau >> CTO and Co-Founder, Dremio >> >> On Wed, Jan 20, 2016 at 5:21 PM, Jake Luciani <j...@apache.org> wrote: >> >>> That's great! So it's going straight to TLP? >>> Hey Everyone, >>> >>> Good news! The Apache board has approved the Apache Arrow as a new TLP. >>> I've asked the Apache INFRA team to set up required resources so we can >>> start moving forward (ML, Git, Website, etc). >>> >>> I've started working on a press release to announce the Apache Arrow >>> project and will circulate a draft shortly. Once the project mailing lists >>> are established, we can move this thread over there to continue >>> discussions. They had us do one of change to the proposal during the board >>> call which was to remove the initial committers (separate from initial >>> pmc). Once we establish the PMC list, we can immediately add the additional >>> committers as our first PMC action. >>> >>> thanks to everyone! >>> Jacques >>> >>> >>> -- >>> Jacques Nadeau >>> CTO and Co-Founder, Dremio >>> >>> On Tue, Jan 12, 2016 at 11:03 PM, Julien Le Dem <jul...@dremio.com> >>> wrote: >>> >>>> +1 on a repo for the spec. >>>> I do have questions as well. >>>> In particular for the metadata. >>>> >>>> On Tue, Jan 12, 2016 at 6:59 PM, Wes McKinney <w...@cloudera.com> wrote: >>>> >>>>> On Tue, Jan 12, 2016 at 6:21 PM, Parth Chandra <par...@apache.org> >>>>> wrote: >>>>> > >>>>> > >>>>> > On Tue, Jan 12, 2016 at 9:57 AM, Wes McKinney <w...@cloudera.com> >>>>> wrote: >>>>> >> >>>>> >> >>>>> >> > >>>>> >> > As far as the existing work is concerned, I'm not sure everyone >>>>> is aware >>>>> >> > of >>>>> >> > the C++ code inside of Drill that can represent at least the >>>>> scalar >>>>> >> > types in >>>>> >> > Drill's existing Value Vectors [1]. This is currently used by the >>>>> native >>>>> >> > client written to hook up an ODBC driver. >>>>> >> > >>>>> >> >>>>> >> I have read this code. From my perspective, it would be less work to >>>>> >> collaborate on a self-contained implementation that closely models >>>>> the >>>>> >> Arrow / VV spec that includes builder classes and its own memory >>>>> >> management without coupling to Drill details. I started prototyping >>>>> >> something here (warning: only a few actual days of coding here): >>>>> >> >>>>> >> https://github.com/arrow-data/arrow-cpp/tree/master/src/arrow >>>>> >> >>>>> >> For example, you can see an example constructing an Array<Int32> or >>>>> >> String (== Array<UInt8>) column in the tests here >>>>> >> >>>>> >> >>>>> >> >>>>> https://github.com/arrow-data/arrow-cpp/blob/master/src/arrow/builder-test.cc#L328 >>>>> >> >>>>> >> I've been planning to use this as the basis of a C++ Parquet >>>>> >> reader-writer and the associated Python pandas-like layer which >>>>> >> includes in-memory analytics on Arrow data structures. >>>>> >> >>>>> >> > Parth who is included here has been the primary owner of this C++ >>>>> code >>>>> >> > throughout it's life in Drill. Parth, what do you think is the >>>>> best >>>>> >> > strategy >>>>> >> > for managing the C++ code right now? As the C++ build is not tied >>>>> into >>>>> >> > the >>>>> >> > Java one, as I understand it we just run it manually when updates >>>>> are >>>>> >> > made >>>>> >> > there and we need to update ODBC. Would it be disruptive to move >>>>> the >>>>> >> > code to >>>>> >> > the arrow repo? If so, we could include Drill as a submodule in >>>>> the new >>>>> >> > repo, or put Wes's work so far in the Drill repo. >>>>> >> >>>>> >> If we can enumerate the non-Drill-client related parts (i.e. the >>>>> array >>>>> >> accessors and data structures-oriented code) that would make sense >>>>> in >>>>> >> a standalone Arrow library it would be great to start a side >>>>> >> discussion about the design of the C++ reference implementation >>>>> >> (metadata / schemas, IPC, array builders and accessors, etc.). Since >>>>> >> this is a quite urgent for me (intending to deliver a minimally >>>>> viable >>>>> >> pandas-like Arrow + Parquet in Python stack in the next ~3 months) >>>>> it >>>>> >> would be great to do this sooner rather than later. >>>>> >> >>>>> > >>>>> > Most of the code for Drill C++ Value Vectors is independent of >>>>> Drill - >>>>> > mostly the code upto line 787 in this file - >>>>> > >>>>> https://github.com/apache/drill/blob/master/contrib/native/client/src/include/drill/recordBatch.hpp >>>>> > >>>>> > My thought was to leave the Drill implementation alone and borrow >>>>> copiously >>>>> > from it when convenient for Arrow. Seems like we can still do that >>>>> building >>>>> > on Wes' work. >>>>> > >>>>> >>>>> Makes sense. Speaking of code, would you all like me to set up a >>>>> temporary repo for the specification itself? I already have a few >>>>> questions like how and where to track array null counts. >>>>> >>>>> > Wes, let me know if you want to have a quick hangout on this. >>>>> > >>>>> >>>>> Sure, I'll follow up separately to get something on the calendar. >>>>> Looking forward to connecting! >>>>> >>>>> > Parth >>>>> > >>>>> > >>>>> >>>> >>>> >>>> >>>> -- >>>> Julien >>>> >>> >>> >>