Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Wes McKinney Mon, 26 Oct 2015 17:23:37 -0700

hi all,

I am excited about this initiative and I personally am looking forward
to seeing a standard in-memory columnar representation made available
to data science languages like Python, R, and Julia, and it's also the
ideal place to build out a reference vectorized Parquet implementation
for use in those languages (lack of Python/R Parquet support has been
a sore spot for the data science ecosystem in recent times). This will
also enable us to create an ecosystem of interoperable tools amongst
SQL (Drill, Impala, ...) and other compute systems (e.g. Spark) and
columnar storage systems (e.g. Kudu, Parquet, etc.).


Having richer in-memory columnar data structures alone will be a boon
for the data science languages, which are working also to improve both
in-memory analytics and out-of-core algorithms, and any distributed
compute or storage system that can interoperate with these tools will
benefit.

thanks,
Wes

On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <jacq...@dremio.com> wrote:
>
> Drillers,
>
>
>
> A number of people have approached me recently about the possibility of 
> collaborating on a shared columnar in-memory representation of data. This 
> shared representation of data could be operated on efficiently with modern 
> cpus as well as shared efficiently via shared memory, IPC and RPC. This would 
> allow multiple applications to work together at high speed. Examples include 
> moving back and forth between a library.
>
>
>
> As I was discussing these ideas with people working on projects including 
> Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from companies like 
> MapR and Trifacta, it became clear that much of what the Drill community has 
> already constructed is very relevant to the goals of a new broader 
> interchange and execution format. (In fact, Ted and I actually informally 
> discussed extracting this functionality as a library more than two years ago.)
>
>
>
> A standard will emerge around this need and it is in the best interest of the 
> Drill community and the broader ecosystem if Drill’s ValueVectors concepts 
> and code form the basis of a new library/collaboration/project. This means 
> better interoperability, shared responsibility around maintenance and 
> development and the avoidance of further division of the ecosystem.
>
>
>
> A little background for some: Drill is the first project to create a powerful 
> language agnostic in-memory representation of complex columnar data. We've 
> learned a lot over the last three years about how to interface with these 
> structures, manage memory associated with them, adjust their sizes, expose 
> them in builder patterns, etc. That work is useful for a number of systems 
> and it would be great if we could share the learning. By creating a new, well 
> documented and collaborative library, people could leverage this 
> functionality in wider range of applications and systems.
>
>
>
> I’ve seen the great success that libraries like Parquet and Calcite have been 
> able to achieve due to their focus on APIs, extensibility and reusability and 
> I think we could do the same with the Drill ValueVector codebase. The fact 
> that this would allow higher speed interchange among many other systems and 
> becoming the standard for in-memory columnar exchange (as opposed to having 
> to adopt an external standard) makes this a great opportunity to both benefit 
> the Drill community and give back to the broader Apache community.
>
>
>
> As such, I’d like to open a discussion about taking this path. I think there 
> would be various avenues of how to do this but my initial proposal would be 
> to propose this as a new project that goes straight to a provisional TLP. We 
> then would work to clean up layer responsibilities and extract pieces of the 
> code into this new project where we collaborate with a wider group on a 
> broader implementation (and more formal specification).
>
>
> Given the conversations I have had and the excitement and need for this, I 
> think we should do this. If the community is supportive, we could probably 
> see some really cool integrations around things like high-speed Python 
> machine learning inside Drill operators before the end of the year.
>
>
>
> I’ll open a new JIRA and attach it here where we can start a POC & discussion 
> of how we could extract this code.
>
>
> Looking forward to feedback!
>
>
> Jacques
>
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Reply via email to