+1, looking forward to vectorized Parquet Readers/Writers in Drill. Making VV a standalone standard sounds great to me.
On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <[email protected]> wrote: > +1. Agree with Hanifi that we probably should have done this sooner :). > Jason and I faced this need when trying to get a stand alone vectorized > parquet reader out of the Drill code last year. > > > > On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <[email protected]> wrote: > > > I was hoping to see this discussion happening sooner :) VVs has helped > > Drill representing and moving data around so flexibly that it would not > be > > hard to prove its usefulness to the community as a standalone library. I > am > > in support of this proposal. > > > > > > -Hanifi > > > > On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <[email protected]> > > wrote: > > > > > Drillers, > > > > > > > > > > > > A number of people have approached me recently about the possibility of > > > collaborating on a shared columnar in-memory representation of data. > This > > > shared representation of data could be operated on efficiently with > > modern > > > cpus as well as shared efficiently via shared memory, IPC and RPC. This > > > would allow multiple applications to work together at high speed. > > Examples > > > include moving back and forth between a library. > > > > > > > > > > > > As I was discussing these ideas with people working on projects > including > > > Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies > > > like MapR and Trifacta, it became clear that much of what the Drill > > > community has already constructed is very relevant to the goals of a > new > > > broader interchange and execution format. (In fact, Ted and I actually > > > informally discussed extracting this functionality as a library more > than > > > two years ago.) > > > > > > > > > > > > A standard will emerge around this need and it is in the best interest > of > > > the Drill community and the broader ecosystem if Drill’s ValueVectors > > > concepts and code form the basis of a new > library/collaboration/project. > > > This means better interoperability, shared responsibility around > > > maintenance and development and the avoidance of further division of > the > > > ecosystem. > > > > > > > > > > > > A little background for some: Drill is the first project to create a > > > powerful language agnostic in-memory representation of complex columnar > > > data. We've learned a lot over the last three years about how to > > interface > > > with these structures, manage memory associated with them, adjust their > > > sizes, expose them in builder patterns, etc. That work is useful for a > > > number of systems and it would be great if we could share the learning. > > By > > > creating a new, well documented and collaborative library, people could > > > leverage this functionality in wider range of applications and systems. > > > > > > > > > > > > I’ve seen the great success that libraries like Parquet and Calcite > have > > > been able to achieve due to their focus on APIs, extensibility and > > > reusability and I think we could do the same with the Drill ValueVector > > > codebase. The fact that this would allow higher speed interchange among > > > many other systems and becoming the standard for in-memory columnar > > > exchange (as opposed to having to adopt an external standard) makes > this > > a > > > great opportunity to both benefit the Drill community and give back to > > the > > > broader Apache community. > > > > > > > > > > > > As such, I’d like to open a discussion about taking this path. I think > > > there would be various avenues of how to do this but my initial > proposal > > > would be to propose this as a new project that goes straight to a > > > provisional TLP. We then would work to clean up layer responsibilities > > and > > > extract pieces of the code into this new project where we collaborate > > with > > > a wider group on a broader implementation (and more formal > > specification). > > > > > > > > > Given the conversations I have had and the excitement and need for > this, > > I > > > think we should do this. If the community is supportive, we could > > probably > > > see some really cool integrations around things like high-speed Python > > > machine learning inside Drill operators before the end of the year. > > > > > > > > > > > > I’ll open a new JIRA and attach it here where we can start a POC & > > > discussion of how we could extract this code. > > > > > > > > > Looking forward to feedback! > > > > > > > > > Jacques > > > > > > > > > -- > > > Jacques Nadeau > > > CTO and Co-Founder, Dremio > > > > > > -- Julien
