Hi Enrico, +1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries that you mentioned I add awkward array, a library used in High Energy Physics (for those interested more details on how we tested awkward array with Spark from back when mapInArrow was introduced can be found at https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md )
Cheers, Luca From: Enrico Minack <i...@enrico.minack.dev> Sent: Thursday, October 26, 2023 15:33 To: dev <dev@spark.apache.org> Subject: On adding applyInArrow to groupBy and cogroup Hi devs, PySpark allows to transform a DataFrame via Pandas and Arrow API: df.mapInArrow(map_arrow, schema="...") df.mapInPandas(map_pandas, schema="...") For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas interface, no Arrow interface: df.groupBy("id").applyInPandas(apply_pandas, schema="...") Providing a pure Arrow interface allows user code to use any Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow interfaces reduces the need to add more framework-specific support. We need your thoughts on whether PySpark should support Arrow on a par with Pandas, or not: https://github.com/apache/spark/pull/38624 Cheers, Enrico