Thanks Ian for the detailed feedback. Ibis + substrait is exactly what I was looking for. I can continue building on the existing ibis backends to make a custom engine for my use case with arrow for now as the substrait framework develops.
Hi Matthew, I tried polars and it failed at the first step to create pyDataFrame from a parquet file(pyarrow library has no issues reading it). Gave up on it as I am not very familiar with rust. Also the reason I didn't choose the datafusion route. I am a bit more familiar with cpp. Thanks all for your amazing work and helpful feedback. Hopefully someday I can contribute to this wonderful project. On Thu, Mar 31, 2022, 10:09 AM Matthew Peters <[email protected]> wrote: > I have been playing around with polars for a few weeks. They have an > expression system that is different from pandas , but it feels really clean > and is built on arrow in rust. If you need something now, this would be a > good option. It does everything mentioned here and adds a lazy evaluation > system and is very fast. > > pola-rs/polars: Fast multi-threaded DataFrame library in Rust | Python | > Node.js (github.com) <https://github.com/pola-rs/polars> > > I just thought I would mention it. > > Thanks > Matt > > On Thu, Mar 31, 2022 at 8:54 AM Ian Cook <[email protected]> wrote: > >> There is work ongoing to implement a pandas-like interface to the >> Arrow C++ compute engine using Ibis [1] and Substrait [2]. >> >> The gist of this is: Python users write Ibis code; the ibis-substrait >> submodule of Ibis [3] compiles this into a Substrait plan; and the >> Arrow C++ compute engine parses the Substrait plan and executes it, >> returning an Arrow table. >> >> There is a recent Voltron Data blog post co-authored by Phillip Cloud >> (the lead developer of Ibis) and Jacques Nadeau (the originator of >> Substrait) with some additional details [4]. >> >> Since Substrait is an open standard, this can all be implemented >> without introducing tight coupling between the interface and the >> engine, at least in theory. Other Substrait-compatible engines will be >> able to consume and execute Substrait plans, and other >> Substrait-compatible interfaces will be able to produce Substrait >> plans. There is already some work ongoing to support Substrait in >> other interfaces and engines including the Arrow R interface [5], the >> DataFusion engine [6], and the DuckDB engine [7]. >> >> I am not sure when the Ibis-Substrait-Arrow stack is expected to reach >> milestones of "real-world usability." I will leave it to others to >> comment about that >> >> Thanks, >> Ian >> >> [1] https://ibis-project.org >> [2] https://substrait.io >> [3] https://github.com/ibis-project/ibis-substrait >> [4] >> https://voltrondata.com/news/introducing-substrait-an-interoperable-data-to-engine-connector/ >> [5] https://github.com/voltrondata/substrait-r >> [6] https://github.com/datafusion-contrib/datafusion-substrait >> [7] https://github.com/duckdblabs/duckdb-substrait-demo >> >> >> On Thu, Mar 31, 2022 at 7:25 AM Suresh V <[email protected]> wrote: >> > >> > Thanks Aldrin for the response. My goal is to have something like >> numexpr/pandas equivalent where I can simply pass strings to get the >> results and if arrow gods have any plans to create something like that in >> the near future :) >> > >> > On Thu, Mar 31, 2022 at 3:15 AM Aldrin <[email protected]> wrote: >> >> >> >> I think everything Weston said is totally good. >> >> >> >> I just wanted to add that there may be a couple of "simpler" options >> available without using SQL-like approaches: >> >> (1) just use the compute API (at least for examples such as "sum(...) >> / sum(...)") >> >> (2) when converting strings to expressions, do it in 2 passes: (a) put >> expressions supported by the dataset API into projections and scans on the >> source table, then (b) use the compute API for any other expressions and >> apply them to the final relation from (a). So, for one of the given >> examples, "sum(a*b) / sum(a)", this ends up being 2 passes of applying >> expressions: once via the dataset API, and once via the compute API. >> >> >> >> Note that (1) seems simple in my naive opinion, but (2) may actually >> be quite complicated. But, how much extra work it is probably depends on >> how you were planning on converting strings to expressions anyways. And >> note that you can do many iterations of (2) for using aggregates in filters >> or projections, etc. >> >> >> >> ---- >> >> >> >> The expression: >> >> >> >> sum(a * b) / sum(a) >> >> >> >> would be: >> >> >> >> ``` >> >> import pyarrow >> >> from pyarrow import dataset >> >> from pyarrow import compute >> >> >> >> tab = pyarrow.Table.from_arrays( >> >> [ >> >> pyarrow.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >> >> ,pyarrow.array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2]) >> >> ] >> >> ,names=['a', 'b'] >> >> ) >> >> >> >> # some chain of scans and projections via the dataset API >> >> relational_result = dataset.Scanner.from_batches( >> >> tab.to_batches() >> >> ,schema=tab.schema >> >> ,columns={ >> >> 'a': dataset.field('a') >> >> ,'product': dataset.field('a') * dataset.field('b') >> >> } >> >> ).to_table() >> >> >> >> # a single "expression" that uses the compute API and the final >> relation from the previous "phase" >> >> expr_result = ( >> >> compute.sum(relational_result.column('product')) >> >> / compute.sum(relational_result.column('a')) >> >> ) >> >> ``` >> >> >> >> Aldrin Montana >> >> Computer Science PhD Student >> >> UC Santa Cruz >> >> >> >> >> >> On Wed, Mar 30, 2022 at 8:21 PM Weston Pace <[email protected]> >> wrote: >> >>> >> >>> Yes and no :) Disclaimer: this answer is a little out of my >> >>> wheelhouse as I've learned relational algebra through doing and so my >> >>> formal theory may be off. Anyone is welcome to correct me. Also, >> >>> this answer turned into a bit of ramble and is a bit scattershot. You >> >>> may already be very familiar with some of these concepts. I'm not >> >>> trying to be patronizing but I do tend to ramble. >> >>> >> >>> TL;DR: Pyarrow has some "relational algebra interfaces" today but not >> >>> a lot of "dataframe interfaces". What you're asking for is a little >> >>> bit more of a "dataframe" type question and you will probably need to >> >>> go beyond pyarrow to get exactly what you are asking for. That being >> >>> said, pyarrow has a lot of the primitives that can solve parts and >> >>> pieces of your problem. >> >>> >> >>> In relational algebra there is a special class of functions called >> >>> "scalar functions". These are functions that create a single value >> >>> for each row. For example, "less than", "addition", and "to upper >> >>> case". A scalar expression is then an expression that consists only >> >>> of scalar functions. Projections in dataset scanners can only contain >> >>> scalar expressions. In pyarrow you can scan an in-memory table and >> >>> apply a scalar expression: >> >>> >> >>> ``` >> >>> import pyarrow as pa >> >>> import pyarrow.dataset as ds >> >>> >> >>> arr = pa.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) >> >>> tab = pa.Table.from_arrays([arr], names=["x"]) >> >>> expr = (ds.field("x") > 5) & (ds.field("x") < 8) >> >>> # This is a little bit unwieldy, we could maybe investigate a better >> >>> API for this kind of thing >> >>> ds.Scanner.from_batches(tab.to_batches(), schema=tab.schema, >> >>> columns={'y': expr3}).to_table() >> >>> # pyarrow.Table >> >>> # y: bool >> >>> # ---- >> >>> # y: [[false,false,false,false,false,true,true,false,false,false]] >> >>> ``` >> >>> >> >>> However, the functions that you are listing (sum, avg) are not scalar >> >>> functions. They do, however, fit a special class of functions known >> >>> as "aggregates" (functions that consume values from multiple rows to >> >>> create a single output). Aggregates in relational algebra are used in >> >>> a "group by" node. In pyarrow 7 we added the ability to run group_by >> >>> functions very easily on in-memory tables but it looks like it >> >>> requires at least one key column so that isn't going to help us here. >> >>> >> >>> Both of these features are powered by the pyarrow compute functions. >> >>> These are slightly lower level compute primitives. We can use these >> >>> here to get some of the values you want: >> >>> >> >>> ``` >> >>> # Continuing from above example >> >>> pc.sum(tab.column("x")) >> >>> # <pyarrow.Int64Scalar: 55> >> >>> pc.mean(tab.column("x")) >> >>> # <pyarrow.DoubleScalar: 5.5> >> >>> ``` >> >>> >> >>> But...while this does give you the capability to run functions, this >> >>> doesn't give you the capability to run "expressions". >> >>> >> >>> Running expressions that contain both aggregate and scalar functions >> >>> is a little trickier than it may seem. This is often done by creating >> >>> a relational algebra query from the expression. For example, consider >> >>> the expression `sum(a * b)/sum(a)`. >> >>> >> >>> We can create the query (this is totally shorthand pseudocode, the >> >>> Substrait text format isn't ready yet) >> >>> >> >>> SCAN_TABLE(table) -> >> >>> PROJECT({"a*b": multiply(field("a"), field("b"))}) -> >> >>> GROUP_BY(keys=[], aggregates=[("a*b", "sum"), ("a", "sum")]) -> >> >>> PROJECT({"sum(a*b)/sum(a)": >> divide(field("sum(a*b)"),field("sum(a)"))}) >> >>> >> >>> So if you wanted to create such a query plan then you could express it >> >>> in Substrait, which is a spec for expression query plans, and use our >> >>> "very new and not quite ready yet" Substrait consumer API to process >> >>> that query plan. So if your goal is purely "how do I express a series >> >>> of compute operations as a string" then I will point out that SQL is a >> >>> very standard answer for that question and the Substrait text format >> >>> will be a new way to answer that question. >> >>> >> >>> --- >> >>> >> >>> Ok, now everything I've written has been exclusive to pyarrow. If I >> >>> step back and look at your original question "how can I evaluate a >> >>> dataframe-like expression against an Arrow table" I think the answer >> >>> is going to lie outside of pyarrow (some cool functions like table >> >>> group by are being added but I'm not aware of any developers whose >> >>> goal is to make a full fledged dataframe API inside of pyarrow). >> >>> Fortunately, the whole point of Arrow is to make it super easy for >> >>> libraries to add new functionality to your data. There may be some >> >>> library out there that does this already, there are libraries out >> >>> there that solve similar problems like "how do I run this SQL query >> >>> against Arrow data", and there are probably libraries in development >> >>> to tackle this. Unfortunately, I'm not knowledgeable nor qualified >> >>> enough to speak on any of these. >> >>> >> >>> On Wed, Mar 30, 2022 at 1:16 AM Suresh V <[email protected]> >> wrote: >> >>> > >> >>> > Hi, >> >>> > >> >>> > Is there a way to evaluate mathematical expressions against columns >> of a pyarrow table which is in memory already similar to how projections >> work for dataset scanners? >> >>> > >> >>> > The goal is to have specify a bunch of strings like sum(a * >> b)/sum(a), or avg(a[:10]) etc. Convert these into expressions and run >> against the table. >> >>> > >> >>> > Thanks >> >>> > >> >>> > >> > > > -- > ------------------------------------------------------- > Matthew Peters > [email protected] >
