[jira] [Created] (SPARK-38833) PySpark allows applyInPandas return empty DataFrame without columns

Enrico Minack (Jira) Fri, 08 Apr 2022 05:36:13 -0700

Enrico Minack created SPARK-38833:
-------------------------------------

             Summary: PySpark allows applyInPandas return empty DataFrame 
without columns
                 Key: SPARK-38833
                 URL: https://issues.apache.org/jira/browse/SPARK-38833
             Project: Spark
          Issue Type: Improvement
          Components: PySpark, SQL
    Affects Versions: 3.4.0
            Reporter: Enrico Minack



Currently, returning an empty Pandas DataFrame from {{applyInPandas}} raises an 
error:

{noformat}
RuntimeError: Number of columns of the returned pandas.DataFrame doesn't match 
specified schema. Expected: 2 Actual: 0
{noformat}

Here is an example:
{code}
import pandas as pd  

from pyspark.sql.functions import pandas_udf, ceil

df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))  

def mean_func(key, pdf):
    if key == (1,):
        return pd.DataFrame([])
    else:
        return pd.DataFrame([key + (pdf.v.mean(),)])

df.groupby('id').applyInPandas(mean_func, schema="id long, v double").show()
{code}

Since the schema is defined when calling {{applyInPandas()}}, it looks 
redundant to define the columns when returning an empty {{pd.DataFrame}}. 
Returning a non-empty DataFrame does not require defining columns, so returning 
an empty DataFrame shouldn't require that either.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-38833) PySpark allows applyInPandas return empty DataFrame without columns

Reply via email to