[jira] [Resolved] (SPARK-40559) Add applyInArrow to pyspark.sql.GroupedData

Hyukjin Kwon (Jira) Sun, 03 Dec 2023 17:22:05 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-40559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon resolved SPARK-40559.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 38624
[https://github.com/apache/spark/pull/38624]

> Add applyInArrow to pyspark.sql.GroupedData
> -------------------------------------------
>
>                 Key: SPARK-40559
>                 URL: https://issues.apache.org/jira/browse/SPARK-40559
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark
>    Affects Versions: 3.4.0
>            Reporter: Enrico Minack
>            Assignee: Enrico Minack
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> PySpark allows to transform a {{DataFrame}} via Pandas and Arrow API:
> {code:python}
> def map_arrow(iter: Iterator[pyarrow.RecordBatch]) -> 
> Iterator[pyarrow.RecordBatch]:
>     return iter
> def map_pandas(iter: Iterator[pandas.DataFrame]) -> 
> Iterator[pandas.DataFrame]:
>     return iter
> df.mapInArrow(map_arrow, schema="...")
> df.mapInPandas(map_pandas, schema="...")
> {code}
> A grouped {{DataFrame}} currently supports only the Pandas API:
> {code:python}
> def apply_pandas(df: pandas.DataFrame) -> pandas.DataFrame:
>     return df
> df.groupBy("id").applyInPandas(apply_pandas, schema="...")
> {code}
> A similar method for the Arrow API would useful, especially given that Arrow 
> is used by many other libraries.
> An Arrow-based method allows to process the {{DataFrame}} with *any* 
> Arrow-based API, e.g. Polars:
> {code:python}
> def apply_polars(df: polars.DataFrame) -> polars.DataFrame:
>   return df
> def apply_arrow(iter: Iterator[pyarrow.RecordBatch]) -> 
> Iterator[pyarrow.RecordBatch]:
>   for batch in iter:
>     df = polars.from_arrow(pyarrow.Table.from_batches([batch]))
>     for b in apply_polars(df).to_arrow().to_batches():
>       yield b
> df.groupBy("id").applyInArrow(apply_arrow, schema="...")
> {code}
> https://stackoverflow.com/questions/71606278/is-there-an-apache-arrow-equivalent-of-the-spark-pandas-udf
> https://stackoverflow.com/questions/73203318/how-to-transform-spark-dataframe-to-polars-dataframe



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40559) Add applyInArrow to pyspark.sql.GroupedData

Reply via email to