Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Hanifi Gunes Mon, 26 Oct 2015 14:38:38 -0700

I was hoping to see this discussion happening sooner :) VVs has helped
Drill representing and moving data around so flexibly that it would not be
hard to prove its usefulness to the community as a standalone library. I am
in support of this proposal.



-Hanifi

On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <[email protected]> wrote:

> Drillers,
>
>
>
> A number of people have approached me recently about the possibility of
> collaborating on a shared columnar in-memory representation of data. This
> shared representation of data could be operated on efficiently with modern
> cpus as well as shared efficiently via shared memory, IPC and RPC. This
> would allow multiple applications to work together at high speed. Examples
> include moving back and forth between a library.
>
>
>
> As I was discussing these ideas with people working on projects including
> Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies
> like MapR and Trifacta, it became clear that much of what the Drill
> community has already constructed is very relevant to the goals of a new
> broader interchange and execution format. (In fact, Ted and I actually
> informally discussed extracting this functionality as a library more than
> two years ago.)
>
>
>
> A standard will emerge around this need and it is in the best interest of
> the Drill community and the broader ecosystem if Drill’s ValueVectors
> concepts and code form the basis of a new library/collaboration/project.
> This means better interoperability, shared responsibility around
> maintenance and development and the avoidance of further division of the
> ecosystem.
>
>
>
> A little background for some: Drill is the first project to create a
> powerful language agnostic in-memory representation of complex columnar
> data. We've learned a lot over the last three years about how to interface
> with these structures, manage memory associated with them, adjust their
> sizes, expose them in builder patterns, etc. That work is useful for a
> number of systems and it would be great if we could share the learning. By
> creating a new, well documented and collaborative library, people could
> leverage this functionality in wider range of applications and systems.
>
>
>
> I’ve seen the great success that libraries like Parquet and Calcite have
> been able to achieve due to their focus on APIs, extensibility and
> reusability and I think we could do the same with the Drill ValueVector
> codebase. The fact that this would allow higher speed interchange among
> many other systems and becoming the standard for in-memory columnar
> exchange (as opposed to having to adopt an external standard) makes this a
> great opportunity to both benefit the Drill community and give back to the
> broader Apache community.
>
>
>
> As such, I’d like to open a discussion about taking this path. I think
> there would be various avenues of how to do this but my initial proposal
> would be to propose this as a new project that goes straight to a
> provisional TLP. We then would work to clean up layer responsibilities and
> extract pieces of the code into this new project where we collaborate with
> a wider group on a broader implementation (and more formal specification).
>
>
> Given the conversations I have had and the excitement and need for this, I
> think we should do this. If the community is supportive, we could probably
> see some really cool integrations around things like high-speed Python
> machine learning inside Drill operators before the end of the year.
>
>
>
> I’ll open a new JIRA and attach it here where we can start a POC &
> discussion of how we could extract this code.
>
>
> Looking forward to feedback!
>
>
> Jacques
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Reply via email to