Let me just quickly correct a couple of point, for clarity. The following module https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py is a proof-of-concept of oamap running directly on arrow memory - this is the original reason I raised the topic here.
There are also POCs showing operation on numpy records and parquet files. oamap was written with ROOT in mind, but is not necessarily tied to it; it just so happens that ROOT data tends to have just the kind of deeply nested structures that tend to perform terribly in a pandas apply situation or other python object-based processing. oamap depends on numba/llvm-lite So indeed maybe this all belongs as a conversation in pyarrow rather than arrow, but insomuch as it enables - or may in the future enable - machine-speed computation on in-memory nested arrow data, I think oamap should be on everyone’s radar as a interesting and useful project. > On 25 Jun 2018, at 16:45, Wes McKinney <wesmck...@gmail.com> wrote: > > hi Martin, > > These projects are very different. Many analytic databases feature > code generation (recently a lot of these use LLVM -- see Hyper, Apache > Impala, and others) on the hot paths for function evaluation (e.g. for > evaluating the expressions in the SELECT part or the WHERE part) -- > the reason people are excited about Gandiva is that it makes this type > of functionality available as a library running atop an open standard > memory format (Arrow columnar), so can be used in any programming > language assuming suitable bindings can be developed. This is very > much in line with our vision for creating a "deconstructed database" > (see a talk that Julien gave on this topic: > https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database) > > I have not looked a great deal at oamap, but it does not use the Arrow > columnar format AFAIK. It is written in Python and presumes some other > technologies in use (like the ROOT format). > > So to summarize: > > Gandiva > * Compiles analytical expressions to execute against Arrow columnar format > * Is written in C++ and can be embedded in other systems (Dremio is > using it from Java) > > oamap > * Does not use the Arrow columnar format > * Presumes other technologies in use (ROOT) > * Is written in Python, and would be challenging to use an embedded > system component > > I'm certain these projects can learn from each other -- I have spoken > with Jim (one of the developers of oamap) in the past, so welcome > further discussion here on the mailing list. > > Thanks, > Wes > > On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant > <martin.dur...@utoronto.ca> wrote: >> I am a little surprised by the very positive reception to Gandiva (which >> doubtless is very useful - I know very little about it) versus when I >> brought up the prospect of using oamap ( https://github.com/diana-hep/oamap >> ) on this mailing list. >> >> oamap uses numba to compile *python* functions at run-time and can walk >> complex nested schema down to leaf nodes in native python syntax (for-loops >> and attribute/item lookup) but at full machine speed, and without >> materialising any objects along the way. It was written for the ROOT format, >> but has implementations for simple types in parquet and arrow, which each do >> the nested lists and dict things similarly but differently. >> >> Would someone care to explain the silence over oamap? >> >>> On 25 Jun 2018, at 02:06, Praveen Kumar <prav...@dremio.com> wrote: >>> >>> Hi Everyone, >>> >>> I am Praveen, another engineer working on Gandiva. The interest and speed >>> of engagement around this is great !!Excited to engage with you folks on >>> this. >>> >>> Thx. >>> >>> On 2018/06/22 18:09:42, Julian Hyde <j...@apache.org> wrote: >>>> This is exciting. We have wanted to build an Arrow adapter in Calcite for >>>> some time and have a prototype (see >>>> https://issues.apache.org/jira/browse/CALCITE-2173 >>>> <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we >>>> can use Gandiva. I know that Gandiva has Java bindings, but will these >>>> allow queries to be compiled and executed from a pure Java process?> >>>> >>>> Can you describe Gandiva’s governance model? Without an open governance >>>> model, companies that compete with Dremio may be wary about contributing.> >>>> >>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also >>>> concerned with efficient use to the bus, and also uses LLVM, but it has a >>>> different memory format and places much emphasis on lock-free data >>>> structures.> >>>> >>>> I just attended SIGMOD and there were interesting industry papers from >>>> MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks >>>> MemSQL uses to achieve SIMD parallelism on queries such as “select k4, >>>> sum(x) from t group by k4” (where k4 has 4 values).> >>>> >>>> I missed part of the RAPID talk, but I got the impression that they are >>>> using disk-based algorithms (e.g. hybrid hash join) to handle data spread >>>> between fast and slow memory.> >>>> >>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would >>>> be good target for Gandiva also. It is a table scan with a range filter >>>> (returning 98% of rows), a low-cardinality aggregate (grouping by two >>>> fields with 3 values each), and several aggregate functions, the arguments >>>> of which contain common sub-expressions.> >>>> >>>> SELECT> >>>> l_returnflag,> >>>> l_linestatus,> >>>> sum(l_quantity),> >>>> sum(l_extendedprice),> >>>> sum(l_extendedprice * (1 - l_discount)),> >>>> sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),> >>>> avg(l_quantity),> >>>> avg(l_extendedprice),> >>>> avg(l_discount),> >>>> count(*)> >>>> FROM lineitem> >>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day> >>>> GROUP BY> >>>> l_returnflag,> >>>> l_linestatus> >>>> ORDER BY> >>>> l_returnflag,> >>>> l_linestatus;> >>>> >>>> Julian> >>>> >>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf >>>> <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>> >>>> >>>> [2] >>>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/ >>>> >>>> <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>> >>>> >>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 >>>> <https://dl.acm.org/citation.cfm?id=3183713.3190658>> >>>> >>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 >>>> <https://dl.acm.org/citation.cfm?id=3183713.3190655>> >>>> >>>>> On Jun 22, 2018, at 7:22 AM, ravind...@gmail.com wrote:> >>>>>> >>>>> Hi everyone,> >>>>>> >>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do believe >>>>> that the combination of arrow and llvm for efficient expression >>>>> evaluation is powerful, and has a broad range of use-cases. We've just >>>>> started and hope to finesse and add a lot of functionality over the next >>>>> few months.> >>>>>> >>>>> Welcome your feedback and participation in gandiva !!> >>>>>> >>>>> thanks & regards,> >>>>> ravindra.> >>>>>> >>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: > >>>>>> Hey Guys,> >>>>>>> >>>>>> Dremio just open sourced a new framework for processing data in Arrow >>>>>> data> >>>>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging> >>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage the >>>>>> Apache> >>>>>> Arrow Java libraries. I expect the developers who have been working on >>>>>> this> >>>>>> will introduce themselves soon. To read more about it, take a look at >>>>>> our> >>>>>> Ravindra's blog post (he's the lead developer driving this work): [2].> >>>>>> Hopefully people will find this interesting/useful.> >>>>>>> >>>>>> Let us know what you all think!> >>>>>>> >>>>>> thanks,> >>>>>> Jacques> >>>>>>> >>>>>>> >>>>>> [1] https://github.com/dremio/gandiva> >>>>>> [2] >>>>>> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/> >>>>>>> >>>> >> >> — >> Martin Durant >> martin.dur...@utoronto.ca >> >> >> — Martin Durant martin.dur...@utoronto.ca