I'm definitely +1 to include this. - It seems like an odd feature parity gap to have a map function but no group apply function. - There's currently no way to use large arrow types with applyInPandas, which can lead to errors hitting the 2 GiB max string/binary array size. I have a PR to Arrow that would let Spark be able to use the large types for pandas operations, but that still hasn't been merged, will require another Arrow release, and then require a lot of updates on the Spark side to accommodate. Hyukjin already knocked most of this out in a closed PR, so hopefully it won't take too long to resurrect that after the Arrow PR. Until then though the only way to support large string/binary types would be through a direct applyInArrow function. - "Just use applyInPandas and convert that to arrow" or "just use mapInArrow and do a custom manual grouping" seem like odd arguments against including it, as performance is the main reason to use other libraries like Polars (or just avoiding an extra conversion to Pandas for no reason), and by the same reasoning there would be no reason to ever have created mapInArrow or applyInPandas, respectively. - Based on multiple people commenting on the PR, it doesn't seem so much of a niche corner case. But aren't there a lot of performance improvements trying to optimize corner cases anyway?
Adam On Thu, Oct 26, 2023 at 9:35 AM Enrico Minack <i...@enrico.minack.dev> wrote: > Hi devs, > > PySpark allows to transform a DataFrame via Pandas *and* Arrow API: > > df.mapInArrow(map_arrow, schema="...") > df.mapInPandas(map_pandas, schema="...") > > For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a > Pandas interface, no Arrow interface: > > df.groupBy("id").applyInPandas(apply_pandas, schema="...") > > Providing a pure Arrow interface allows user code to use *any* > Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow > interfaces reduces the need to add more framework-specific support. > > We need your thoughts on whether PySpark should support Arrow on a par > with Pandas, or not: https://github.com/apache/spark/pull/38624 > Cheers, > Enrico > > -- Adam Binford