RE: On adding applyInArrow to groupBy and cogroup

Luca Canali Fri, 03 Nov 2023 08:54:29 -0700

Hi Enrico,

+1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries 
that you mentioned I add awkward array, a library used in High Energy Physics
(for those interested more details on how we tested awkward array with Spark 
from back when mapInArrow was introduced can be found at 
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
 )

Cheers,
Luca

From: Enrico Minack <i...@enrico.minack.dev>
Sent: Thursday, October 26, 2023 15:33
To: dev <dev@spark.apache.org>
Subject: On adding applyInArrow to groupBy and cogroup

Hi devs,

PySpark allows to transform a DataFrame via Pandas and Arrow API:

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas 
interface, no Arrow interface:

df.groupBy("id").applyInPandas(apply_pandas, schema="...")

Providing a pure Arrow interface allows user code to use any Arrow-based data 
framework, not only Pandas, e.g. Polars. Adding Arrow interfaces reduces the 
need to add more framework-specific support.

We need your thoughts on whether PySpark should support Arrow on a par with 
Pandas, or not: https://github.com/apache/spark/pull/38624
Cheers,
Enrico

RE: On adding applyInArrow to groupBy and cogroup

Reply via email to