Seeing more support for arrow based functions would be great. Gives more control to application developers. And so pandas just becomes 1 of the available options.
On Fri, 3 Nov 2023, 21:23 Luca Canali, <luca.can...@cern.ch> wrote: > Hi Enrico, > > > > +1 on supporting Arrow on par with Pandas. Besides the frameworks and > libraries that you mentioned I add awkward array, a library used in High > Energy Physics > > (for those interested more details on how we tested awkward array with > Spark from back when mapInArrow was introduced can be found at > https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md > ) > > > > Cheers, > > Luca > > > > *From:* Enrico Minack <i...@enrico.minack.dev> > *Sent:* Thursday, October 26, 2023 15:33 > *To:* dev <dev@spark.apache.org> > *Subject:* On adding applyInArrow to groupBy and cogroup > > > > Hi devs, > > PySpark allows to transform a DataFrame via Pandas *and* Arrow API: > > df.mapInArrow(map_arrow, schema="...") > df.mapInPandas(map_pandas, schema="...") > > For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a > Pandas interface, no Arrow interface: > > df.groupBy("id").applyInPandas(apply_pandas, schema="...") > > Providing a pure Arrow interface allows user code to use *any* > Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow > interfaces reduces the need to add more framework-specific support. > > We need your thoughts on whether PySpark should support Arrow on a par > with Pandas, or not: https://github.com/apache/spark/pull/38624 > > Cheers, > Enrico >