Hi Weston,

Thanks a lot for the explanation. Indeed I am looking to get a better
understanding of what is in the road map of the arrow project to avoid
doing something custom :) Your answer did give me a lot of clarity.


On Wed, Mar 30, 2022, 11:21 PM Weston Pace <[email protected]> wrote:

> Yes and no :)  Disclaimer: this answer is a little out of my
> wheelhouse as I've learned relational algebra through doing and so my
> formal theory may be off.  Anyone is welcome to correct me.  Also,
> this answer turned into a bit of ramble and is a bit scattershot.  You
> may already be very familiar with some of these concepts.  I'm not
> trying to be patronizing but I do tend to ramble.
>
> TL;DR: Pyarrow has some "relational algebra interfaces" today but not
> a lot of "dataframe interfaces".  What you're asking for is a little
> bit more of a "dataframe" type question and you will probably need to
> go beyond pyarrow to get exactly what you are asking for.  That being
> said, pyarrow has a lot of the primitives that can solve parts and
> pieces of your problem.
>
> In relational algebra there is a special class of functions called
> "scalar functions".  These are functions that create a single value
> for each row.  For example, "less than", "addition", and "to upper
> case".  A scalar expression is then an expression that consists only
> of scalar functions.  Projections in dataset scanners can only contain
> scalar expressions.  In pyarrow you can scan an in-memory table and
> apply a scalar expression:
>
> ```
> import pyarrow as pa
> import pyarrow.dataset as ds
>
> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
> tab = pa.Table.from_arrays([arr], names=["x"])
> expr = (ds.field("x") > 5) & (ds.field("x") < 8)
> # This is a little bit unwieldy, we could maybe investigate a better
> API for this kind of thing
> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> columns={'y': expr3}).to_table()
> # pyarrow.Table
> # y: bool
> # ----
> # y: [[false,false,false,false,false,true,true,false,false,false]]
> ```
>
> However, the functions that you are listing (sum, avg) are not scalar
> functions.  They do, however, fit a special class of functions known
> as "aggregates" (functions that consume values from multiple rows to
> create a single output).  Aggregates in relational algebra are used in
> a "group by" node.  In pyarrow 7 we added the ability to run group_by
> functions very easily on in-memory tables but it looks like it
> requires at least one key column so that isn't going to help us here.
>
> Both of these features are powered by the pyarrow compute functions.
> These are slightly lower level compute primitives.  We can use these
> here to get some of the values you want:
>
> ```
> # Continuing from above example
> pc.sum(tab.column("x"))
> # <pyarrow.Int64Scalar: 55>
> pc.mean(tab.column("x"))
> # <pyarrow.DoubleScalar: 5.5>
> ```
>
> But...while this does give you the capability to run functions, this
> doesn't give you the capability to run "expressions".
>
> Running expressions that contain both aggregate and scalar functions
> is a little trickier than it may seem.  This is often done by creating
> a relational algebra query from the expression.  For example, consider
> the expression `sum(a * b)/sum(a)`.
>
> We can create the query (this is totally shorthand pseudocode, the
> Substrait text format isn't ready yet)
>
> SCAN_TABLE(table) ->
>   PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
>   GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
>   PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))})
>
> So if you wanted to create such a query plan then you could express it
> in Substrait, which is a spec for expression query plans, and use our
> "very new and not quite ready yet" Substrait consumer API to process
> that query plan.  So if your goal is purely "how do I express a series
> of compute operations as a string" then I will point out that SQL is a
> very standard answer for that question and the Substrait text format
> will be a new way to answer that question.
>
> ---
>
> Ok, now everything I've written has been exclusive to pyarrow.  If I
> step back and look at your original question "how can I evaluate a
> dataframe-like expression against an Arrow table" I think the answer
> is going to lie outside of pyarrow (some cool functions like table
> group by are being added but I'm not aware of any developers whose
> goal is to make a full fledged dataframe API inside of pyarrow).
> Fortunately, the whole point of Arrow is to make it super easy for
> libraries to add new functionality to your data.  There may be some
> library out there that does this already, there are libraries out
> there that solve similar problems like "how do I run this SQL query
> against Arrow data", and there are probably libraries in development
> to tackle this.  Unfortunately, I'm not knowledgeable nor qualified
> enough to speak on any of these.
>
> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote:
> >
> > Hi,
> >
> > Is there a way to evaluate mathematical expressions against columns of a
> pyarrow table which is in memory already similar to how projections work
> for dataset scanners?
> >
> > The goal is to have specify a bunch of strings like sum(a * b)/sum(a),
> or avg(a[:10]) etc. Convert these into expressions and run against the
> table.
> >
> > Thanks
> >
> >
>

Reply via email to