Hi Weston, Thanks a lot for the explanation. Indeed I am looking to get a better understanding of what is in the road map of the arrow project to avoid doing something custom :) Your answer did give me a lot of clarity.
On Wed, Mar 30, 2022, 11:21 PM Weston Pace <[email protected]> wrote: > Yes and no :) Disclaimer: this answer is a little out of my > wheelhouse as I've learned relational algebra through doing and so my > formal theory may be off. Anyone is welcome to correct me. Also, > this answer turned into a bit of ramble and is a bit scattershot. You > may already be very familiar with some of these concepts. I'm not > trying to be patronizing but I do tend to ramble. > > TL;DR: Pyarrow has some "relational algebra interfaces" today but not > a lot of "dataframe interfaces". What you're asking for is a little > bit more of a "dataframe" type question and you will probably need to > go beyond pyarrow to get exactly what you are asking for. That being > said, pyarrow has a lot of the primitives that can solve parts and > pieces of your problem. > > In relational algebra there is a special class of functions called > "scalar functions". These are functions that create a single value > for each row. For example, "less than", "addition", and "to upper > case". A scalar expression is then an expression that consists only > of scalar functions. Projections in dataset scanners can only contain > scalar expressions. In pyarrow you can scan an in-memory table and > apply a scalar expression: > > ``` > import pyarrow as pa > import pyarrow.dataset as ds > > arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) > tab = pa.Table.from_arrays([arr], names=["x"]) > expr = (ds.field("x") > 5) & (ds.field("x") < 8) > # This is a little bit unwieldy, we could maybe investigate a better > API for this kind of thing > ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, > columns={'y': expr3}).to_table() > # pyarrow.Table > # y: bool > # ---- > # y: [[false,false,false,false,false,true,true,false,false,false]] > ``` > > However, the functions that you are listing (sum, avg) are not scalar > functions. They do, however, fit a special class of functions known > as "aggregates" (functions that consume values from multiple rows to > create a single output). Aggregates in relational algebra are used in > a "group by" node. In pyarrow 7 we added the ability to run group_by > functions very easily on in-memory tables but it looks like it > requires at least one key column so that isn't going to help us here. > > Both of these features are powered by the pyarrow compute functions. > These are slightly lower level compute primitives. We can use these > here to get some of the values you want: > > ``` > # Continuing from above example > pc.sum(tab.column("x")) > # <pyarrow.Int64Scalar: 55> > pc.mean(tab.column("x")) > # <pyarrow.DoubleScalar: 5.5> > ``` > > But...while this does give you the capability to run functions, this > doesn't give you the capability to run "expressions". > > Running expressions that contain both aggregate and scalar functions > is a little trickier than it may seem. This is often done by creating > a relational algebra query from the expression. For example, consider > the expression `sum(a * b)/sum(a)`. > > We can create the query (this is totally shorthand pseudocode, the > Substrait text format isn't ready yet) > > SCAN_TABLE(table) -> > PROJECT({"a*b": multiply(field("a"), field("b"))}) -> > GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) -> > PROJECT({"sum(a*b)/sum(a)": divide(field("sum(a*b)"),field("sum(a)"))}) > > So if you wanted to create such a query plan then you could express it > in Substrait, which is a spec for expression query plans, and use our > "very new and not quite ready yet" Substrait consumer API to process > that query plan. So if your goal is purely "how do I express a series > of compute operations as a string" then I will point out that SQL is a > very standard answer for that question and the Substrait text format > will be a new way to answer that question. > > --- > > Ok, now everything I've written has been exclusive to pyarrow. If I > step back and look at your original question "how can I evaluate a > dataframe-like expression against an Arrow table" I think the answer > is going to lie outside of pyarrow (some cool functions like table > group by are being added but I'm not aware of any developers whose > goal is to make a full fledged dataframe API inside of pyarrow). > Fortunately, the whole point of Arrow is to make it super easy for > libraries to add new functionality to your data. There may be some > library out there that does this already, there are libraries out > there that solve similar problems like "how do I run this SQL query > against Arrow data", and there are probably libraries in development > to tackle this. Unfortunately, I'm not knowledgeable nor qualified > enough to speak on any of these. > > On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> wrote: > > > > Hi, > > > > Is there a way to evaluate mathematical expressions against columns of a > pyarrow table which is in memory already similar to how projections work > for dataset scanners? > > > > The goal is to have specify a bunch of strings like sum(a * b)/sum(a), > or avg(a[:10]) etc. Convert these into expressions and run against the > table. > > > > Thanks > > > > >
