Enrico Minack created SPARK-38833: ------------------------------------- Summary: PySpark allows applyInPandas return empty DataFrame without columns Key: SPARK-38833 URL: https://issues.apache.org/jira/browse/SPARK-38833 Project: Spark Issue Type: Improvement Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Enrico Minack
Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises an error: {noformat} RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match specified schema. Expected: 2 Actual: 0 {noformat} Here is an example: {code} import pandas as pd from pyspark.sql.functions import pandas_udf, ceil df = spark.createDataFrame( [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)], ("id", "v")) def mean_func(key, pdf): if key == (1,): return pd.DataFrame([]) else: return pd.DataFrame([key + (pdf.v.mean(),)]) df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show() {code} Since the schema is defined when calling {{applyInPandas()}}, it looks redundant to define the columns when returning an empty {{pd.DataFrame}}. Returning a non-empty DataFrame does not require defining columns, so returning an empty DataFrame shouldn't require that either. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org