date:20231026

On adding applyInArrow to groupBy and cogroup

2023-10-26 Thread Enrico Minack


Hi devs,

PySpark allows to transform a |DataFrame| via Pandas *and* Arrow API:

df.mapInArrow(map_arrow, schema="...")
df.mapInPandas(map_pandas, schema="...")

For |df.groupBy(...)| and |df.groupBy(...).cogroup(...)|, there is 
*only* a Pandas interface, no Arrow interface:


df.groupBy("id").applyInPandas(apply_pandas, schema="...")

Providing a pure Arrow interface allows user code to use *any* 
Arrow-based data framework, not only Pandas, e.g. Polars. Adding Arrow 
interfaces reduces the need to add more framework-specific support.


We need your thoughts on whether PySpark should support Arrow on a par 
with Pandas, or not: https://github.com/apache/spark/pull/38624


Cheers,
Enrico

Unsubscribe

2023-10-26 Thread 杨军

Unsubscribe

Unsubscribe

2023-10-26 Thread Kiran Kumar Dusi

Unsubscribe

On adding applyInArrow to groupBy and cogroup

Unsubscribe

Unsubscribe

3 matches

Site Navigation

Mail list logo

Footer information