Re: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Abdeali Kothari
Seeing more support for arrow based functions would be great.
Gives more control to application developers. And so pandas just becomes 1
of the available options.

On Fri, 3 Nov 2023, 21:23 Luca Canali,  wrote:

> Hi Enrico,
>
>
>
> +1 on supporting Arrow on par with Pandas. Besides the frameworks and
> libraries that you mentioned I add awkward array, a library used in High
> Energy Physics
>
> (for those interested more details on how we tested awkward array with
> Spark from back when mapInArrow was introduced can be found at
> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
> )
>
>
>
> Cheers,
>
> Luca
>
>
>
> *From:* Enrico Minack 
> *Sent:* Thursday, October 26, 2023 15:33
> *To:* dev 
> *Subject:* On adding applyInArrow to groupBy and cogroup
>
>
>
> Hi devs,
>
> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>
> df.mapInArrow(map_arrow, schema="...")
> df.mapInPandas(map_pandas, schema="...")
>
> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
> Pandas interface, no Arrow interface:
>
> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>
> Providing a pure Arrow interface allows user code to use *any*
> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
> interfaces reduces the need to add more framework-specific support.
>
> We need your thoughts on whether PySpark should support Arrow on a par
> with Pandas, or not: https://github.com/apache/spark/pull/38624
>
> Cheers,
> Enrico
>


RE: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Luca Canali
Hi Enrico,

+1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries 
that you mentioned I add awkward array, a library used in High Energy Physics
(for those interested more details on how we tested awkward array with Spark 
from back when mapInArrow was introduced can be found at 
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
 )

Cheers,
Luca

From: Enrico Minack 
Sent: Thursday, October 26, 2023 15:33
To: dev 
Subject: On adding applyInArrow to groupBy and cogroup


Hi devs,

PySpark allows to transform a DataFrame via Pandas and Arrow API:

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas 
interface, no Arrow interface:

df.groupBy("id").applyInPandas(apply_pandas, schema="...")

Providing a pure Arrow interface allows user code to use any Arrow-based data 
framework, not only Pandas, e.g. Polars. Adding Arrow interfaces reduces the 
need to add more framework-specific support.

We need your thoughts on whether PySpark should support Arrow on a par with 
Pandas, or not: https://github.com/apache/spark/pull/38624
Cheers,
Enrico


unsubscribe

2023-11-03 Thread Stefan Hagedorn