+1 Looking forward to see it to grow outside of Drill, and make VV perhaps not just a implementation but also a spec that has different implementations will be nice as well.
Tim On Tue, Oct 27, 2015 at 5:11 PM, Todd Lipcon <[email protected]> wrote: > +1. We on the Kudu team are interested in exposing data blocks using this > in-memory format as the results of our scan operators (via RPC or shared > memory transport). Standardizing it will make everyone's lives easier (and > the performance much better!) > > -Todd > > On Mon, Oct 26, 2015 at 5:22 PM, Wes McKinney <[email protected]> wrote: > >> hi all, >> >> I am excited about this initiative and I personally am looking forward >> to seeing a standard in-memory columnar representation made available >> to data science languages like Python, R, and Julia, and it's also the >> ideal place to build out a reference vectorized Parquet implementation >> for use in those languages (lack of Python/R Parquet support has been >> a sore spot for the data science ecosystem in recent times). This will >> also enable us to create an ecosystem of interoperable tools amongst >> SQL (Drill, Impala, ...) and other compute systems (e.g. Spark) and >> columnar storage systems (e.g. Kudu, Parquet, etc.). >> >> Having richer in-memory columnar data structures alone will be a boon >> for the data science languages, which are working also to improve both >> in-memory analytics and out-of-core algorithms, and any distributed >> compute or storage system that can interoperate with these tools will >> benefit. >> >> thanks, >> Wes >> >> On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <[email protected]> >> wrote: >> > >> > Drillers, >> > >> > >> > >> > A number of people have approached me recently about the possibility of >> collaborating on a shared columnar in-memory representation of data. This >> shared representation of data could be operated on efficiently with modern >> cpus as well as shared efficiently via shared memory, IPC and RPC. This >> would allow multiple applications to work together at high speed. Examples >> include moving back and forth between a library. >> > >> > >> > >> > As I was discussing these ideas with people working on projects >> including Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from >> companies like MapR and Trifacta, it became clear that much of what the >> Drill community has already constructed is very relevant to the goals of a >> new broader interchange and execution format. (In fact, Ted and I actually >> informally discussed extracting this functionality as a library more than >> two years ago.) >> > >> > >> > >> > A standard will emerge around this need and it is in the best interest >> of the Drill community and the broader ecosystem if Drill’s ValueVectors >> concepts and code form the basis of a new library/collaboration/project. >> This means better interoperability, shared responsibility around >> maintenance and development and the avoidance of further division of the >> ecosystem. >> > >> > >> > >> > A little background for some: Drill is the first project to create a >> powerful language agnostic in-memory representation of complex columnar >> data. We've learned a lot over the last three years about how to interface >> with these structures, manage memory associated with them, adjust their >> sizes, expose them in builder patterns, etc. That work is useful for a >> number of systems and it would be great if we could share the learning. By >> creating a new, well documented and collaborative library, people could >> leverage this functionality in wider range of applications and systems. >> > >> > >> > >> > I’ve seen the great success that libraries like Parquet and Calcite have >> been able to achieve due to their focus on APIs, extensibility and >> reusability and I think we could do the same with the Drill ValueVector >> codebase. The fact that this would allow higher speed interchange among >> many other systems and becoming the standard for in-memory columnar >> exchange (as opposed to having to adopt an external standard) makes this a >> great opportunity to both benefit the Drill community and give back to the >> broader Apache community. >> > >> > >> > >> > As such, I’d like to open a discussion about taking this path. I think >> there would be various avenues of how to do this but my initial proposal >> would be to propose this as a new project that goes straight to a >> provisional TLP. We then would work to clean up layer responsibilities and >> extract pieces of the code into this new project where we collaborate with >> a wider group on a broader implementation (and more formal specification). >> > >> > >> > Given the conversations I have had and the excitement and need for this, >> I think we should do this. If the community is supportive, we could >> probably see some really cool integrations around things like high-speed >> Python machine learning inside Drill operators before the end of the year. >> > >> > >> > >> > I’ll open a new JIRA and attach it here where we can start a POC & >> discussion of how we could extract this code. >> > >> > >> > Looking forward to feedback! >> > >> > >> > Jacques >> > >> > >> > -- >> > Jacques Nadeau >> > CTO and Co-Founder, Dremio >> > >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera
