Re: On adding applyInArrow to groupBy and cogroup

Adam Binford Sat, 28 Oct 2023 10:55:36 -0700

I'm definitely +1 to include this.
- It seems like an odd feature parity gap to have a map function but no
group apply function.
- There's currently no way to use large arrow types with applyInPandas,
which can lead to errors hitting the 2 GiB max string/binary array size. I
have a PR to Arrow that would let Spark be able to use the large types for
pandas operations, but that still hasn't been merged, will require another
Arrow release, and then require a lot of updates on the Spark side to
accommodate. Hyukjin already knocked most of this out in a closed PR, so
hopefully it won't take too long to resurrect that after the Arrow PR.
Until then though the only way to support large string/binary types would
be through a direct applyInArrow function.
- "Just use applyInPandas and convert that to arrow" or "just use
mapInArrow and do a custom manual grouping" seem like odd arguments against
including it, as performance is the main reason to use other libraries like
Polars (or just avoiding an extra conversion to Pandas for no reason), and
by the same reasoning there would be no reason to ever have created
mapInArrow or applyInPandas, respectively.
- Based on multiple people commenting on the PR, it doesn't seem so much of
a niche corner case. But aren't there a lot of performance improvements
trying to optimize corner cases anyway?


Adam


On Thu, Oct 26, 2023 at 9:35 AM Enrico Minack <i...@enrico.minack.dev>
wrote:

> Hi devs,
>
> PySpark allows to transform a DataFrame via Pandas *and* Arrow API:
>
> df.mapInArrow(map_arrow, schema="...")
> df.mapInPandas(map_pandas, schema="...")
>
> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a
> Pandas interface, no Arrow interface:
>
> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
>
> Providing a pure Arrow interface allows user code to use *any*
> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow
> interfaces reduces the need to add more framework-specific support.
>
> We need your thoughts on whether PySpark should support Arrow on a par
> with Pandas, or not: https://github.com/apache/spark/pull/38624
> Cheers,
> Enrico
>
>

-- 
Adam Binford

Re: On adding applyInArrow to groupBy and cogroup

Reply via email to