I think everything Weston said is totally good.
I just wanted to add that there may be a couple of "simpler" options
available without using SQL-like approaches:
(1) just use the compute API (at least for examples such as "sum(...) /
sum(...)")
(2) when converting strings to expressions, do it in 2 passes: (a) put
expressions supported by the dataset API into projections and scans on the
source table, then (b) use the compute API for any other expressions and
apply them to the final relation from (a). So, for one of the given
examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying
expressions: once via the dataset API, and once via the compute API.
Note that (1) seems simple in my naive opinion, but (2) may actually be
quite complicated. But, how much extra work it is probably depends on how
you were planning on converting strings to expressions anyways. And note
that you can do many iterations of (2) for using aggregates in filters or
projections, etc.
----
The expression:
sum(a * b) / sum(a)
would be:
```
import pyarrow
from pyarrow import dataset
from pyarrow import compute
tab = pyarrow.Table.from_arrays(
[
pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
]
,names=['a', 'b']
)
# some chain of scans and projections via the dataset API
relational_result = dataset.Scanner.from_batches(
tab.to_batches()
,schema=tab.schema
,columns={
'a': dataset.field('a')
,'product': dataset.field('a') * dataset.field('b')
}
).to_table()
# a single "expression" that uses the compute API and the final
relation from the previous "phase"
expr_result = (
compute.sum(relational_result.column('product'))
/ compute.sum(relational_result.column('a'))
)
```
Aldrin Montana
Computer Science PhD Student
UC Santa Cruz
On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> wrote:
> Yes and no :) Disclaimer: this answer is a little out of my
> wheelhouse as I've learned relational algebra through doing and so my
> formal theory may be off. Anyone is welcome to correct me. Also,
> this answer turned into a bit of ramble and is a bit scattershot. You
> may already be very familiar with some of these concepts. I'm not
> trying to be patronizing but I do tend to ramble.
>
> TL;DR: Pyarrow has some "relational algebra interfaces" today but not
> a lot of "dataframe interfaces". What you're asking for is a little
> bit more of a "dataframe" type question and you will probably need to
> go beyond pyarrow to get exactly what you are asking for. That being
> said, pyarrow has a lot of the primitives that can solve parts and
> pieces of your problem.
>
> In relational algebra there is a special class of functions called
> "scalar functions". These are functions that create a single value
> for each row. For example, "less than", "addition", and "to upper
> case". A scalar expression is then an expression that consists only
> of scalar functions. Projections in dataset scanners can only contain
> scalar expressions. In pyarrow you can scan an in-memory table and
> apply a scalar expression:
>
> ```
> import pyarrow as pa
> import pyarrow.dataset as ds
>
> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
> tab = pa.Table.from_arrays([arr], names=["x"])
> expr = (ds.field("x") > 5) & (ds.field("x") < 8)
> # This is a little bit unwieldy, we could maybe investigate a better
> API for this kind of thing
> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema,
> columns={'y': expr3}).to_table()
> # pyarrow.Table
> # y: bool
> # ----
> # y: [[false,false,false,false,false,true,true,false,false,false]]
> ```
>
> However, the functions that you are listing (sum, avg) are not scalar
> functions. They do, however, fit a special class of functions known
> as "aggregates" (functions that consume values from multiple rows to
> create a single output). Aggregates in relational algebra are used in
> a "group by" node. In pyarrow 7 we added the ability to run group_by
> functions very easily on in-memory tables but it looks like it
> requires at least one key column so that isn't going to help us here.
>
> Both of these features are powered by the pyarrow compute functions.
> These are slightly lower level compute primitives. We can use these
> here to get some of the values you want:
>
> ```
> # Continuing from above example
> pc.sum(tab.column("x"))
> # <pyarrow.Int64Scalar: 55>
> pc.mean(tab.column("x"))
> # <pyarrow.DoubleScalar: 5.5>
> ```
>
> But...while this does give you the capability to run functions, this
> doesn't give you the capability to run "expressions".
>
> Running expressions that contain both aggregate and scalar functions
> is a little trickier than it may seem. This is often done by creating
> a relational algebra query from the expression. For example, consider
> the expression `sum(a * b)/sum(a)`.
>
> We can create the query (this is totally shorthand pseudocode, the
> Substrait text format isn't ready yet)
>
> SCAN_TABLE(table) ->
> PROJECT({"a*b": multiply(field("a"), field("b"))}) ->
> GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) ->
> PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))})
>
> So if you wanted to create such a query plan then you could express it
> in Substrait, which is a spec for expression query plans, and use our
> "very new and not quite ready yet" Substrait consumer API to process
> that query plan. So if your goal is purely "how do I express a series
> of compute operations as a string" then I will point out that SQL is a
> very standard answer for that question and the Substrait text format
> will be a new way to answer that question.
>
> ---
>
> Ok, now everything I've written has been exclusive to pyarrow. If I
> step back and look at your original question "how can I evaluate a
> dataframe-like expression against an Arrow table" I think the answer
> is going to lie outside of pyarrow (some cool functions like table
> group by are being added but I'm not aware of any developers whose
> goal is to make a full fledged dataframe API inside of pyarrow).
> Fortunately, the whole point of Arrow is to make it super easy for
> libraries to add new functionality to your data. There may be some
> library out there that does this already, there are libraries out
> there that solve similar problems like "how do I run this SQL query
> against Arrow data", and there are probably libraries in development
> to tackle this. Unfortunately, I'm not knowledgeable nor qualified
> enough to speak on any of these.
>
> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote:
> >
> > Hi,
> >
> > Is there a way to evaluate mathematical expressions against columns of a
> pyarrow table which is in memory already similar to how projections work
> for dataset scanners?
> >
> > The goal is to have specify a bunch of strings like sum(a * b)/sum(a),
> or avg(a[:10]) etc. Convert these into expressions and run against the
> table.
> >
> > Thanks
> >
> >
>