Re: On adding applyInArrow to groupBy and cogroup

2023-11-06 Thread Hyukjin Kwon
Sounds good, I'll review the PR.

On Fri, 3 Nov 2023 at 14:08, Abdeali Kothari 
wrote:

> Seeing more support for arrow based functions would be great.
> Gives more control to application developers. And so pandas just becomes 1
> of the available options.
>
> On Fri, 3 Nov 2023, 21:23 Luca Canali,  wrote:
>
>> Hi Enrico,
>>
>>
>>
>> +1 on supporting Arrow on par with Pandas. Besides the frameworks and
>> libraries that you mentioned I add awkward array, a library used in High
>> Energy Physics
>>
>> (for those interested more details on how we tested awkward array with
>> Spark from back when mapInArrow was introduced can be found at
>> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
>> )
>>
>>
>>
>> Cheers,
>>
>> Luca
>>
>>
>>
>> *From:* Enrico Minack 
>> *Sent:* Thursday, October 26, 2023 15:33
>> *To:* dev 
>> *Subject:* On adding applyInArrow to groupBy and cogroup
>>
>>
>>
>> Hi devs,
>>
>> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>>
>> df.mapInArrow(map_arrow, schema="...")
>> df.mapInPandas(map_pandas, schema="...")
>>
>> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
>> Pandas interface, no Arrow interface:
>>
>> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>>
>> Providing a pure Arrow interface allows user code to use *any*
>> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
>> interfaces reduces the need to add more framework-specific support.
>>
>> We need your thoughts on whether PySpark should support Arrow on a par
>> with Pandas, or not: https://github.com/apache/spark/pull/38624
>>
>> Cheers,
>> Enrico
>>
>


Re: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Abdeali Kothari
Seeing more support for arrow based functions would be great.
Gives more control to application developers. And so pandas just becomes 1
of the available options.

On Fri, 3 Nov 2023, 21:23 Luca Canali,  wrote:

> Hi Enrico,
>
>
>
> +1 on supporting Arrow on par with Pandas. Besides the frameworks and
> libraries that you mentioned I add awkward array, a library used in High
> Energy Physics
>
> (for those interested more details on how we tested awkward array with
> Spark from back when mapInArrow was introduced can be found at
> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
> )
>
>
>
> Cheers,
>
> Luca
>
>
>
> *From:* Enrico Minack 
> *Sent:* Thursday, October 26, 2023 15:33
> *To:* dev 
> *Subject:* On adding applyInArrow to groupBy and cogroup
>
>
>
> Hi devs,
>
> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>
> df.mapInArrow(map_arrow, schema="...")
> df.mapInPandas(map_pandas, schema="...")
>
> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
> Pandas interface, no Arrow interface:
>
> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>
> Providing a pure Arrow interface allows user code to use *any*
> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
> interfaces reduces the need to add more framework-specific support.
>
> We need your thoughts on whether PySpark should support Arrow on a par
> with Pandas, or not: https://github.com/apache/spark/pull/38624
>
> Cheers,
> Enrico
>


RE: On adding applyInArrow to groupBy and cogroup

2023-11-03 Thread Luca Canali
Hi Enrico,

+1 on supporting Arrow on par with Pandas. Besides the frameworks and libraries 
that you mentioned I add awkward array, a library used in High Energy Physics
(for those interested more details on how we tested awkward array with Spark 
from back when mapInArrow was introduced can be found at 
https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
 )

Cheers,
Luca

From: Enrico Minack 
Sent: Thursday, October 26, 2023 15:33
To: dev 
Subject: On adding applyInArrow to groupBy and cogroup


Hi devs,

PySpark allows to transform a DataFrame via Pandas and Arrow API:

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

For df.groupBy(...) and df.groupBy(...).cogroup(...), there is only a Pandas 
interface, no Arrow interface:

df.groupBy("id").applyInPandas(apply_pandas, schema="...")

Providing a pure Arrow interface allows user code to use any Arrow-based data 
framework, not only Pandas, e.g. Polars. Adding Arrow interfaces reduces the 
need to add more framework-specific support.

We need your thoughts on whether PySpark should support Arrow on a par with 
Pandas, or not: https://github.com/apache/spark/pull/38624
Cheers,
Enrico


Re: On adding applyInArrow to groupBy and cogroup

2023-10-28 Thread Adam Binford
I'm definitely +1 to include this.
- It seems like an odd feature parity gap to have a map function but no
group apply function.
- There's currently no way to use large arrow types with applyInPandas,
which can lead to errors hitting the 2 GiB max string/binary array size. I
have a PR to Arrow that would let Spark be able to use the large types for
pandas operations, but that still hasn't been merged, will require another
Arrow release, and then require a lot of updates on the Spark side to
accommodate. Hyukjin already knocked most of this out in a closed PR, so
hopefully it won't take too long to resurrect that after the Arrow PR.
Until then though the only way to support large string/binary types would
be through a direct applyInArrow function.
- "Just use applyInPandas and convert that to arrow" or "just use
mapInArrow and do a custom manual grouping" seem like odd arguments against
including it, as performance is the main reason to use other libraries like
Polars (or just avoiding an extra conversion to Pandas for no reason), and
by the same reasoning there would be no reason to ever have created
mapInArrow or applyInPandas, respectively.
- Based on multiple people commenting on the PR, it doesn't seem so much of
a niche corner case. But aren't there a lot of performance improvements
trying to optimize corner cases anyway?

Adam


On Thu, Oct 26, 2023 at 9:35 AM Enrico Minack 
wrote:

> Hi devs,
>
> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>
> df.mapInArrow(map_arrow, schema="...")
> df.mapInPandas(map_pandas, schema="...")
>
> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
> Pandas interface, no Arrow interface:
>
> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>
> Providing a pure Arrow interface allows user code to use *any*
> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
> interfaces reduces the need to add more framework-specific support.
>
> We need your thoughts on whether PySpark should support Arrow on a par
> with Pandas, or not: https://github.com/apache/spark/pull/38624
> Cheers,
> Enrico
>
>

-- 
Adam Binford