Re: Evaluate expressions on a pyarrow table in-memory

Suresh V Thu, 31 Mar 2022 10:51:26 -0700

Thanks Ian for the detailed feedback. Ibis + substrait is exactly what I
was looking for. I can continue building on the existing ibis backends to
make a custom engine for my use case with arrow for now as the substrait
framework develops.


Hi Matthew, I tried polars and it failed at the first step to create
pyDataFrame from a parquet file(pyarrow library has no issues reading it).
Gave up on it as I am not very familiar with rust. Also the reason I didn't
choose the datafusion route. I am a bit more familiar with cpp.

Thanks all for your amazing work and helpful feedback. Hopefully someday I
can contribute to this wonderful project.

On Thu, Mar 31, 2022, 10:09 AM Matthew Peters <[email protected]> wrote:

> I have been playing around with polars for a few weeks.  They have an
> expression system that is different from pandas , but it feels really clean
> and is built on arrow in rust.  If you need something now, this would be a
> good option.  It does everything mentioned here and adds a lazy evaluation
> system and is very fast.
>
> pola-rs/polars: Fast multi-threaded DataFrame library in Rust | Python |
> Node.js (github.com) <https://github.com/pola-rs/polars>
>
> I just thought I would mention it.
>
> Thanks
> Matt
>
> On Thu, Mar 31, 2022 at 8:54 AM Ian Cook <[email protected]> wrote:
>
>> There is work ongoing to implement a pandas-like interface to the
>> Arrow C++ compute engine using Ibis [1] and Substrait [2].
>>
>> The gist of this is: Python users write Ibis code; the ibis-substrait
>> submodule of Ibis [3] compiles this into a Substrait plan; and the
>> Arrow C++ compute engine parses the Substrait plan and executes it,
>> returning an Arrow table.
>>
>> There is a recent Voltron Data blog post co-authored by Phillip Cloud
>> (the lead developer of Ibis) and Jacques Nadeau (the originator of
>> Substrait) with some additional details [4].
>>
>> Since Substrait is an open standard, this can all be implemented
>> without introducing tight coupling between the interface and the
>> engine, at least in theory. Other Substrait-compatible engines will be
>> able to consume and execute Substrait plans, and other
>> Substrait-compatible interfaces will be able to produce Substrait
>> plans. There is already some work ongoing to support Substrait in
>> other interfaces and engines including the Arrow R interface [5], the
>> DataFusion engine [6], and the DuckDB engine [7].
>>
>> I am not sure when the Ibis-Substrait-Arrow stack is expected to reach
>> milestones of "real-world usability." I will leave it to others to
>> comment about that
>>
>> Thanks,
>> Ian
>>
>> [1] https://ibis-project.org
>> [2] https://substrait.io
>> [3] https://github.com/ibis-project/ibis-substrait
>> [4]
>> https://voltrondata.com/news/introducing-substrait-an-interoperable-data-to-engine-connector/
>> [5] https://github.com/voltrondata/substrait-r
>> [6] https://github.com/datafusion-contrib/datafusion-substrait
>> [7] https://github.com/duckdblabs/duckdb-substrait-demo
>>
>>
>> On Thu, Mar 31, 2022 at 7:25 AM Suresh V <[email protected]> wrote:
>> >
>> > Thanks Aldrin for the response. My goal is to have something like
>> numexpr/pandas equivalent where I can simply pass strings to get the
>> results and if arrow gods have any plans to create something like that in
>> the near future :)
>> >
>> > On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote:
>> >>
>> >> I think everything Weston said is totally good.
>> >>
>> >> I just wanted to add that there may be a couple of "simpler" options
>> available without using SQL-like approaches:
>> >> (1) just use the compute API (at least for examples such as "sum(...)
>> / sum(...)")
>> >> (2) when converting strings to expressions, do it in 2 passes: (a) put
>> expressions supported by the dataset API into projections and scans on the
>> source table, then (b) use the compute API for any other expressions and
>> apply them to the final relation from (a). So, for one of the given
>> examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying
>> expressions: once via the dataset API, and once via the compute API.
>> >>
>> >> Note that (1) seems simple in my naive opinion, but (2) may actually
>> be quite complicated. But, how much extra work it is probably depends on
>> how you were planning on converting strings to expressions anyways. And
>> note that you can do many iterations of (2) for using aggregates in filters
>> or projections, etc.
>> >>
>> >> ----
>> >>
>> >> The expression:
>> >>
>> >>     sum(a * b) / sum(a)
>> >>
>> >> would be:
>> >>
>> >>     ```
>> >>     import pyarrow
>> >>     from pyarrow import dataset
>> >>     from pyarrow import compute
>> >>
>> >>     tab = pyarrow.Table.from_arrays(
>> >>          [
>> >>            pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>> >>           ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
>> >>          ]
>> >>         ,names=['a', 'b']
>> >>     )
>> >>
>> >>     # some chain of scans and projections via the dataset API
>> >>     relational_result = dataset.Scanner.from_batches(
>> >>          tab.to_batches()
>> >>         ,schema=tab.schema
>> >>         ,columns={
>> >>               'a': dataset.field('a')
>> >>              ,'product': dataset.field('a') * dataset.field('b')
>> >>          }
>> >>     ).to_table()
>> >>
>> >>     # a single "expression" that uses the compute API and the final
>> relation from the previous "phase"
>> >>     expr_result = (
>> >>           compute.sum(relational_result.column('product'))
>> >>         / compute.sum(relational_result.column('a'))
>> >>     )
>> >>     ```
>> >>
>> >> Aldrin Montana
>> >> Computer Science PhD Student
>> >> UC Santa Cruz
>> >>
>> >>
>> >> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]>
>> wrote:
>> >>>
>> >>> Yes and no :)  Disclaimer: this answer is a little out of my
>> >>> wheelhouse as I've learned relational algebra through doing and so my
>> >>> formal theory may be off.  Anyone is welcome to correct me.  Also,
>> >>> this answer turned into a bit of ramble and is a bit scattershot.  You
>> >>> may already be very familiar with some of these concepts.  I'm not
>> >>> trying to be patronizing but I do tend to ramble.
>> >>>
>> >>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not
>> >>> a lot of "dataframe interfaces".  What you're asking for is a little
>> >>> bit more of a "dataframe" type question and you will probably need to
>> >>> go beyond pyarrow to get exactly what you are asking for.  That being
>> >>> said, pyarrow has a lot of the primitives that can solve parts and
>> >>> pieces of your problem.
>> >>>
>> >>> In relational algebra there is a special class of functions called
>> >>> "scalar functions".  These are functions that create a single value
>> >>> for each row.  For example, "less than", "addition", and "to upper
>> >>> case".  A scalar expression is then an expression that consists only
>> >>> of scalar functions.  Projections in dataset scanners can only contain
>> >>> scalar expressions.  In pyarrow you can scan an in-memory table and
>> >>> apply a scalar expression:
>> >>>
>> >>> ```
>> >>> import pyarrow as pa
>> >>> import pyarrow.dataset as ds
>> >>>
>> >>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
>> >>> tab = pa.Table.from_arrays([arr], names=["x"])
>> >>> expr = (ds.field("x") > 5) & (ds.field("x") < 8)
>> >>> # This is a little bit unwieldy, we could maybe investigate a better
>> >>> API for this kind of thing
>> >>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
>> >>> columns={'y': expr3}).to_table()
>> >>> # pyarrow.Table
>> >>> # y: bool
>> >>> # ----
>> >>> # y: [[false,false,false,false,false,true,true,false,false,false]]
>> >>> ```
>> >>>
>> >>> However, the functions that you are listing (sum, avg) are not scalar
>> >>> functions.  They do, however, fit a special class of functions known
>> >>> as "aggregates" (functions that consume values from multiple rows to
>> >>> create a single output).  Aggregates in relational algebra are used in
>> >>> a "group by" node.  In pyarrow 7 we added the ability to run group_by
>> >>> functions very easily on in-memory tables but it looks like it
>> >>> requires at least one key column so that isn't going to help us here.
>> >>>
>> >>> Both of these features are powered by the pyarrow compute functions.
>> >>> These are slightly lower level compute primitives.  We can use these
>> >>> here to get some of the values you want:
>> >>>
>> >>> ```
>> >>> # Continuing from above example
>> >>> pc.sum(tab.column("x"))
>> >>> # <pyarrow.Int64Scalar: 55>
>> >>> pc.mean(tab.column("x"))
>> >>> # <pyarrow.DoubleScalar: 5.5>
>> >>> ```
>> >>>
>> >>> But...while this does give you the capability to run functions, this
>> >>> doesn't give you the capability to run "expressions".
>> >>>
>> >>> Running expressions that contain both aggregate and scalar functions
>> >>> is a little trickier than it may seem.  This is often done by creating
>> >>> a relational algebra query from the expression.  For example, consider
>> >>> the expression `sum(a * b)/sum(a)`.
>> >>>
>> >>> We can create the query (this is totally shorthand pseudocode, the
>> >>> Substrait text format isn't ready yet)
>> >>>
>> >>> SCAN_TABLE(table) ->
>> >>>   PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
>> >>>   GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
>> >>>   PROJECT({"sum(a*b)/sum(a)":
>> divide(field("sum(a*b)"),field("sum(a)"))})
>> >>>
>> >>> So if you wanted to create such a query plan then you could express it
>> >>> in Substrait, which is a spec for expression query plans, and use our
>> >>> "very new and not quite ready yet" Substrait consumer API to process
>> >>> that query plan.  So if your goal is purely "how do I express a series
>> >>> of compute operations as a string" then I will point out that SQL is a
>> >>> very standard answer for that question and the Substrait text format
>> >>> will be a new way to answer that question.
>> >>>
>> >>> ---
>> >>>
>> >>> Ok, now everything I've written has been exclusive to pyarrow.  If I
>> >>> step back and look at your original question "how can I evaluate a
>> >>> dataframe-like expression against an Arrow table" I think the answer
>> >>> is going to lie outside of pyarrow (some cool functions like table
>> >>> group by are being added but I'm not aware of any developers whose
>> >>> goal is to make a full fledged dataframe API inside of pyarrow).
>> >>> Fortunately, the whole point of Arrow is to make it super easy for
>> >>> libraries to add new functionality to your data.  There may be some
>> >>> library out there that does this already, there are libraries out
>> >>> there that solve similar problems like "how do I run this SQL query
>> >>> against Arrow data", and there are probably libraries in development
>> >>> to tackle this.  Unfortunately, I'm not knowledgeable nor qualified
>> >>> enough to speak on any of these.
>> >>>
>> >>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]>
>> wrote:
>> >>> >
>> >>> > Hi,
>> >>> >
>> >>> > Is there a way to evaluate mathematical expressions against columns
>> of a pyarrow table which is in memory already similar to how projections
>> work for dataset scanners?
>> >>> >
>> >>> > The goal is to have specify a bunch of strings like sum(a *
>> b)/sum(a), or avg(a[:10]) etc. Convert these into expressions and run
>> against the table.
>> >>> >
>> >>> > Thanks
>> >>> >
>> >>> >
>>
>
>
> --
> -------------------------------------------------------
> Matthew Peters
> [email protected]
>

Re: Evaluate expressions on a pyarrow table in-memory

Reply via email to