Sounds good, I'll review the PR. On Fri, 3 Nov 2023 at 14:08, Abdeali Kothari <abdealikoth...@gmail.com> wrote:
> Seeing more support for arrow based functions would be great. > Gives more control to application developers. And so pandas just becomes 1 > of the available options. > > On Fri, 3 Nov 2023, 21:23 Luca Canali, <luca.can...@cern.ch> wrote: > >> Hi Enrico, >> >> >> >> +1 on supporting Arrow on par with Pandas. Besides the frameworks and >> libraries that you mentioned I add awkward array, a library used in High >> Energy Physics >> >> (for those interested more details on how we tested awkward array with >> Spark from back when mapInArrow was introduced can be found at >> https://github.com/LucaCanali/Miscellaneous/blob/master/Spark_Notes/Spark_MapInArrow.md >> ) >> >> >> >> Cheers, >> >> Luca >> >> >> >> *From:* Enrico Minack <i...@enrico.minack.dev> >> *Sent:* Thursday, October 26, 2023 15:33 >> *To:* dev <dev@spark.apache.org> >> *Subject:* On adding applyInArrow to groupBy and cogroup >> >> >> >> Hi devs, >> >> PySpark allows to transform a DataFrame via Pandas *and* Arrow API: >> >> df.mapInArrow(map_arrow, schema="...") >> df.mapInPandas(map_pandas, schema="...") >> >> For df.groupBy(...) and df.groupBy(...).cogroup(...), there is *only* a >> Pandas interface, no Arrow interface: >> >> df.groupBy("id").applyInPandas(apply_pandas, schema="...") >> >> Providing a pure Arrow interface allows user code to use *any* >> Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow >> interfaces reduces the need to add more framework-specific support. >> >> We need your thoughts on whether PySpark should support Arrow on a par >> with Pandas, or not: https://github.com/apache/spark/pull/38624 >> >> Cheers, >> Enrico >> >