Re: On adding applyInArrow to groupBy and cogroup

Hyukjin Kwon Mon, 06 Nov 2023 11:07:23 -0800

Sounds good, I'll review the PR.

On Fri, 3 Nov 2023 at 14:08, Abdeali Kothari <abdealikoth...@gmail.com>
wrote:


> Seeing more support for arrow based functions would be great.
> Gives more control to application developers. And so pandas just becomes 1
> of the available options.
>
> On Fri, 3 Nov 2023, 21:23 Luca Canali, <luca.can...@cern.ch> wrote:
>
>> Hi Enrico,
>>
>>
>>
>> +1 on supporting Arrow on par with Pandas. Besides the frameworks and
>> libraries that you mentioned I add awkward array, a library used in High
>> Energy Physics
>>
>> (for those interested more details on how we tested awkward array with
>> Spark from back when mapInArrow was introduced can be found at
>> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
>> )
>>
>>
>>
>> Cheers,
>>
>> Luca
>>
>>
>>
>> *From:* Enrico Minack <i...@enrico.minack.dev>
>> *Sent:* Thursday, October 26, 2023 15:33
>> *To:* dev <dev@spark.apache.org>
>> *Subject:* On adding applyInArrow to groupBy and cogroup
>>
>>
>>
>> Hi devs,
>>
>> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>>
>> df.mapInArrow(map_arrow, schema="...")
>> df.mapInPandas(map_pandas, schema="...")
>>
>> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
>> Pandas interface, no Arrow interface:
>>
>> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>>
>> Providing a pure Arrow interface allows user code to use *any*
>> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
>> interfaces reduces the need to add more framework-specific support.
>>
>> We need your thoughts on whether PySpark should support Arrow on a par
>> with Pandas, or not: https://github.com/apache/spark/pull/38624
>>
>> Cheers,
>> Enrico
>>
>

Re: On adding applyInArrow to groupBy and cogroup

Reply via email to