Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Timothy Chen Tue, 27 Oct 2015 15:35:05 -0700

+1
Looking forward to see it to grow outside of Drill, and make VV
perhaps not just a implementation but also a spec that has different
implementations will be nice as well.


Tim

On Tue, Oct 27, 2015 at 5:11 PM, Todd Lipcon <[email protected]> wrote:
> +1. We on the Kudu team are interested in exposing data blocks using this
> in-memory format as the results of our scan operators (via RPC or shared
> memory transport). Standardizing it will make everyone's lives easier (and
> the performance much better!)
>
> -Todd
>
> On Mon, Oct 26, 2015 at 5:22 PM, Wes McKinney <[email protected]> wrote:
>
>> hi all,
>>
>> I am excited about this initiative and I personally am looking forward
>> to seeing a standard in-memory columnar representation made available
>> to data science languages like Python, R, and Julia, and it's also the
>> ideal place to build out a reference vectorized Parquet implementation
>> for use in those languages (lack of Python/R Parquet support has been
>> a sore spot for the data science ecosystem in recent times). This will
>> also enable us to create an ecosystem of interoperable tools amongst
>> SQL (Drill, Impala, ...) and other compute systems (e.g. Spark) and
>> columnar storage systems (e.g. Kudu, Parquet, etc.).
>>
>> Having richer in-memory columnar data structures alone will be a boon
>> for the data science languages, which are working also to improve both
>> in-memory analytics and out-of-core algorithms, and any distributed
>> compute or storage system that can interoperate with these tools will
>> benefit.
>>
>> thanks,
>> Wes
>>
>> On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <[email protected]>
>> wrote:
>> >
>> > Drillers,
>> >
>> >
>> >
>> > A number of people have approached me recently about the possibility of
>> collaborating on a shared columnar in-memory representation of data. This
>> shared representation of data could be operated on efficiently with modern
>> cpus as well as shared efficiently via shared memory, IPC and RPC. This
>> would allow multiple applications to work together at high speed. Examples
>> include moving back and forth between a library.
>> >
>> >
>> >
>> > As I was discussing these ideas with people working on projects
>> including Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from
>> companies like MapR and Trifacta, it became clear that much of what the
>> Drill community has already constructed is very relevant to the goals of a
>> new broader interchange and execution format. (In fact, Ted and I actually
>> informally discussed extracting this functionality as a library more than
>> two years ago.)
>> >
>> >
>> >
>> > A standard will emerge around this need and it is in the best interest
>> of the Drill community and the broader ecosystem if Drill’s ValueVectors
>> concepts and code form the basis of a new library/collaboration/project.
>> This means better interoperability, shared responsibility around
>> maintenance and development and the avoidance of further division of the
>> ecosystem.
>> >
>> >
>> >
>> > A little background for some: Drill is the first project to create a
>> powerful language agnostic in-memory representation of complex columnar
>> data. We've learned a lot over the last three years about how to interface
>> with these structures, manage memory associated with them, adjust their
>> sizes, expose them in builder patterns, etc. That work is useful for a
>> number of systems and it would be great if we could share the learning. By
>> creating a new, well documented and collaborative library, people could
>> leverage this functionality in wider range of applications and systems.
>> >
>> >
>> >
>> > I’ve seen the great success that libraries like Parquet and Calcite have
>> been able to achieve due to their focus on APIs, extensibility and
>> reusability and I think we could do the same with the Drill ValueVector
>> codebase. The fact that this would allow higher speed interchange among
>> many other systems and becoming the standard for in-memory columnar
>> exchange (as opposed to having to adopt an external standard) makes this a
>> great opportunity to both benefit the Drill community and give back to the
>> broader Apache community.
>> >
>> >
>> >
>> > As such, I’d like to open a discussion about taking this path. I think
>> there would be various avenues of how to do this but my initial proposal
>> would be to propose this as a new project that goes straight to a
>> provisional TLP. We then would work to clean up layer responsibilities and
>> extract pieces of the code into this new project where we collaborate with
>> a wider group on a broader implementation (and more formal specification).
>> >
>> >
>> > Given the conversations I have had and the excitement and need for this,
>> I think we should do this. If the community is supportive, we could
>> probably see some really cool integrations around things like high-speed
>> Python machine learning inside Drill operators before the end of the year.
>> >
>> >
>> >
>> > I’ll open a new JIRA and attach it here where we can start a POC &
>> discussion of how we could extract this code.
>> >
>> >
>> > Looking forward to feedback!
>> >
>> >
>> > Jacques
>> >
>> >
>> > --
>> > Jacques Nadeau
>> > CTO and Co-Founder, Dremio
>> >
>>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Reply via email to