[ https://issues.apache.org/jira/browse/SPARK-38833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-38833. ---------------------------------- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36120 [https://github.com/apache/spark/pull/36120] > PySpark applyInPandas should allow to return empty DataFrame without columns > ---------------------------------------------------------------------------- > > Key: SPARK-38833 > URL: https://issues.apache.org/jira/browse/SPARK-38833 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL > Affects Versions: 3.4.0 > Reporter: Enrico Minack > Assignee: Enrico Minack > Priority: Major > Fix For: 3.3.0 > > > Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises > an error: > {noformat} > RuntimeError: Number of columns of the returned pandas.DataFrame doesn't > match specified schema. Expected: 2 Actual: 0 > {noformat} > Here is an example: > {code} > import pandas as pd > from pyspark.sql.functions import pandas_udf, ceil > df = spark.createDataFrame( > [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], > ("id", "v")) > def mean_func(key, pdf): > if key == (1,): > return pd.DataFrame([]) > else: > return pd.DataFrame([key + (pdf.v.mean(),)]) > df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show() > {code} > Since the schema is defined when calling {{applyInPandas()}}, it looks > redundant to define the columns when returning an empty {{pd.DataFrame}}. > Returning a non-empty DataFrame does not require defining columns, so > returning an empty DataFrame shouldn't require that either. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org