Re: On adding applyInArrow to groupBy and cogroup

Abdeali Kothari Fri, 03 Nov 2023 09:20:57 -0700

Seeing more support for arrow based functions would be great.
Gives more control to application developers. And so pandas just becomes 1
of the available options.


On Fri, 3 Nov 2023, 21:23 Luca Canali, <luca.can...@cern.ch> wrote:

> Hi Enrico,
>
>
>
> +1 on supporting Arrow on par with Pandas. Besides the frameworks and
> libraries that you mentioned I add awkward array, a library used in High
> Energy Physics
>
> (for those interested more details on how we tested awkward array with
> Spark from back when mapInArrow was introduced can be found at
> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md
> )
>
>
>
> Cheers,
>
> Luca
>
>
>
> *From:* Enrico Minack <i...@enrico.minack.dev>
> *Sent:* Thursday, October 26, 2023 15:33
> *To:* dev <dev@spark.apache.org>
> *Subject:* On adding applyInArrow to groupBy and cogroup
>
>
>
> Hi devs,
>
> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>
> df.mapInArrow(map_arrow, schema="...")
> df.mapInPandas(map_pandas, schema="...")
>
> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
> Pandas interface, no Arrow interface:
>
> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>
> Providing a pure Arrow interface allows user code to use *any*
> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
> interfaces reduces the need to add more framework-specific support.
>
> We need your thoughts on whether PySpark should support Arrow on a par
> with Pandas, or not: https://github.com/apache/spark/pull/38624
>
> Cheers,
> Enrico
>

Re: On adding applyInArrow to groupBy and cogroup

Reply via email to