Re: Arrow and oamap [was: Re: Gandiva Initiative]

Martin Durant Mon, 25 Jun 2018 14:00:57 -0700

Let me just quickly correct a couple of point, for clarity.
The following module
https://github.com/diana-hep/oamap/blob/master/oamap/backend/arrow.py
is a proof-of-concept of oamap running directly on arrow memory - this is the 
original reason I raised the topic here.


There are also POCs showing operation on numpy records and parquet files. oamap 
was written with ROOT in mind, 
but is not necessarily tied to it; it just so happens that ROOT data tends to 
have just the kind of deeply nested structures
that tend to perform terribly in a pandas apply situation or other python 
object-based processing. oamap depends on 
numba/llvm-lite

So indeed maybe this all belongs as a conversation in pyarrow rather than 
arrow, but insomuch as it enables - or may in
the future enable - machine-speed computation on in-memory nested arrow data, I 
think oamap should be on
everyone’s radar as a interesting and useful project.

> On 25 Jun 2018, at 16:45, Wes McKinney <wesmck...@gmail.com> wrote:
> 
> hi Martin,
> 
> These projects are very different. Many analytic databases feature
> code generation (recently a lot of these use LLVM -- see Hyper, Apache
> Impala, and others) on the hot paths for function evaluation (e.g. for
> evaluating the expressions in the SELECT part or the WHERE part) --
> the reason people are excited about Gandiva is that it makes this type
> of functionality available as a library running atop an open standard
> memory format (Arrow columnar), so can be used in any programming
> language assuming suitable bindings can be developed. This is very
> much in line with our vision for creating a "deconstructed database"
> (see a talk that Julien gave on this topic:
> https://www.slideshare.net/julienledem/from-flat-files-to-deconstructed-database)
> 
> I have not looked a great deal at oamap, but it does not use the Arrow
> columnar format AFAIK. It is written in Python and presumes some other
> technologies in use (like the ROOT format).
> 
> So to summarize:
> 
> Gandiva
> * Compiles analytical expressions to execute against Arrow columnar format
> * Is written in C++ and can be embedded in other systems (Dremio is
> using it from Java)
> 
> oamap
> * Does not use the Arrow columnar format
> * Presumes other technologies in use (ROOT)
> * Is written in Python, and would be challenging to use an embedded
> system component
> 
> I'm certain these projects can learn from each other -- I have spoken
> with Jim (one of the developers of oamap) in the past, so welcome
> further discussion here on the mailing list.
> 
> Thanks,
> Wes
> 
> On Mon, Jun 25, 2018 at 1:27 PM, Martin Durant
> <martin.dur...@utoronto.ca> wrote:
>> I am a little surprised by the very positive reception to Gandiva (which 
>> doubtless is very useful - I know very little about it) versus when I 
>> brought up the prospect of using oamap ( https://github.com/diana-hep/oamap 
>> ) on this mailing list.
>> 
>> oamap uses numba to compile *python* functions at run-time and can walk 
>> complex nested schema down to leaf nodes in native python syntax (for-loops 
>> and attribute/item lookup) but at full machine speed, and without 
>> materialising any objects along the way. It was written for the ROOT format, 
>> but has implementations for simple types in parquet and arrow, which each do 
>> the nested lists and dict things similarly but differently.
>> 
>> Would someone care to explain the silence over oamap?
>> 
>>> On 25 Jun 2018, at 02:06, Praveen Kumar <prav...@dremio.com> wrote:
>>> 
>>> Hi Everyone,
>>> 
>>> I am Praveen, another engineer working on Gandiva. The interest and speed 
>>> of engagement around this is great !!Excited to engage with you folks on 
>>> this.
>>> 
>>> Thx.
>>> 
>>> On 2018/06/22 18:09:42, Julian Hyde <j...@apache.org> wrote:
>>>> This is exciting. We have wanted to build an Arrow adapter in Calcite for 
>>>> some time and have a prototype (see 
>>>> https://issues.apache.org/jira/browse/CALCITE-2173 
>>>> <https://issues.apache.org/jira/browse/CALCITE-2173>) but I hope that we 
>>>> can use Gandiva. I know that Gandiva has Java bindings, but will these 
>>>> allow queries to be compiled and executed from a pure Java process?>
>>>> 
>>>> Can you describe Gandiva’s governance model? Without an open governance 
>>>> model, companies that compete with Dremio may be wary about contributing.>
>>>> 
>>>> Can you compare and contrast your approach to Hyper[1]? Hyper is also 
>>>> concerned with efficient use to the bus, and also uses LLVM, but it has a 
>>>> different memory format and places much emphasis on lock-free data 
>>>> structures.>
>>>> 
>>>> I just attended SIGMOD and there were interesting industry papers from 
>>>> MemSQL[2][3] and Oracle RAPID[4]. I was impressed with some of the tricks 
>>>> MemSQL uses to achieve SIMD parallelism on queries such as “select k4, 
>>>> sum(x) from t group by k4” (where k4 has 4 values).>
>>>> 
>>>> I missed part of the RAPID talk, but I got the impression that they are 
>>>> using disk-based algorithms (e.g. hybrid hash join) to handle data spread 
>>>> between fast and slow memory.>
>>>> 
>>>> MemSQL uses TPC-H query 1 as a motivating benchmark and I think this would 
>>>> be good target for Gandiva also. It is a table scan with a range filter 
>>>> (returning 98% of rows), a low-cardinality aggregate (grouping by two 
>>>> fields with 3 values each), and several aggregate functions, the arguments 
>>>> of which contain common sub-expressions.>
>>>> 
>>>> SELECT>
>>>>   l_returnflag,>
>>>>   l_linestatus,>
>>>>   sum(l_quantity),>
>>>>   sum(l_extendedprice),>
>>>>   sum(l_extendedprice * (1 - l_discount)),>
>>>>   sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),>
>>>>   avg(l_quantity),>
>>>>   avg(l_extendedprice),>
>>>>   avg(l_discount),>
>>>>   count(*)>
>>>> FROM lineitem>
>>>> WHERE l_shipdate <= date '1998-12-01' - interval '90’ day>
>>>> GROUP BY>
>>>>   l_returnflag,>
>>>>   l_linestatus>
>>>> ORDER BY>
>>>>   l_returnflag,>
>>>>   l_linestatus;>
>>>> 
>>>> Julian>
>>>> 
>>>> [1] http://www.vldb.org/pvldb/vol4/p539-neumann.pdf 
>>>> <http://www.vldb.org/pvldb/vol4/p539-neumann.pdf>>
>>>> 
>>>> [2] 
>>>> http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/
>>>>  
>>>> <http://blog.memsql.com/how-careful-engineering-lead-to-processing-over-a-trillion-rows-per-second/>>
>>>> 
>>>> [3] https://dl.acm.org/citation.cfm?id=3183713.3190658 
>>>> <https://dl.acm.org/citation.cfm?id=3183713.3190658>>
>>>> 
>>>> [4] https://dl.acm.org/citation.cfm?id=3183713.3190655 
>>>> <https://dl.acm.org/citation.cfm?id=3183713.3190655>>
>>>> 
>>>>> On Jun 22, 2018, at 7:22 AM, ravind...@gmail.com wrote:>
>>>>>> 
>>>>> Hi everyone,>
>>>>>> 
>>>>> I'm Ravindra and I'm a developer on the Gandiva project. I do believe 
>>>>> that the combination of arrow and llvm for efficient expression 
>>>>> evaluation is powerful, and has a broad range of use-cases. We've just 
>>>>> started and hope to finesse and add a lot of functionality over the next 
>>>>> few months.>
>>>>>> 
>>>>> Welcome your feedback and participation in gandiva !!>
>>>>>> 
>>>>> thanks & regards,>
>>>>> ravindra.>
>>>>>> 
>>>>> On 2018/06/21 19:15:20, Jacques Nadeau <ja...@apache.org> wrote: >
>>>>>> Hey Guys,>
>>>>>>> 
>>>>>> Dremio just open sourced a new framework for processing data in Arrow 
>>>>>> data>
>>>>>> structures [1], built on top of the Apache Arrow C++ APIs and leveraging>
>>>>>> LLVM (Apache licensed). It also includes Java APIs that leverage the 
>>>>>> Apache>
>>>>>> Arrow Java libraries. I expect the developers who have been working on 
>>>>>> this>
>>>>>> will introduce themselves soon. To read more about it, take a look at 
>>>>>> our>
>>>>>> Ravindra's blog post (he's the lead developer driving this work): [2].>
>>>>>> Hopefully people will find this interesting/useful.>
>>>>>>> 
>>>>>> Let us know what you all think!>
>>>>>>> 
>>>>>> thanks,>
>>>>>> Jacques>
>>>>>>> 
>>>>>>> 
>>>>>> [1] https://github.com/dremio/gandiva>
>>>>>> [2] 
>>>>>> https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/>
>>>>>>> 
>>>> 
>> 
>> —
>> Martin Durant
>> martin.dur...@utoronto.ca
>> 
>> 
>> 

—
Martin Durant
martin.dur...@utoronto.ca

Re: Arrow and oamap [was: Re: Gandiva Initiative]

Reply via email to