[ 
https://issues.apache.org/jira/browse/SPARK-34348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Raman Srinivasan updated SPARK-34348:
-------------------------------------
    Description: 
 
{code:java}
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    pdf['count'] = pdf.shape[0]
    return pdf{code}
 

 

Using a DDL-formatted string for output schema works fine:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count 
int").show()

+---+----+-----+
| id|   v|count|
+---+----+-----+
|  1| 1.0|    2|
|  1| 2.0|    2|
|  2| 3.0|    3|
|  2| 5.0|    3|
|  2|10.0|    3|
+---+----+-----+
{code}
 

 

But using StructType schema (appending a integer count column) fails:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, 
schema=df.schema.add(StructField('count', IntegerType(), False))).show()

AnalysisException: Cannot resolve column name "count" among (id, v);

{code}
It appears to be looking for the new return field in the input schema?

As a workaround, is there a toDDL method I can use to get the current schema as 
a DDL string to which I can append the new return fields?

 

 

  was:
 
{code:java}
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

def subtract_mean(pdf):
    # pdf is a pandas.DataFrame
    pdf['count'] = pdf.shape[0]
    return pdf{code}
 

 

Using a DDL-formatted string for output schema works fine:

 
{code:java}
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, count 
int").show()

+---+----+-----+
| id|   v|count|
+---+----+-----+
|  1| 1.0|    2|
|  1| 2.0|    2|
|  2| 3.0|    3|
|  2| 5.0|    3|
|  2|10.0|    3|
+---+----+-----+
{code}
 

 

But using StructType schema (appending a integer count column) fails:
{code:java}
df.groupby("id").applyInPandas(subtract_mean, 
schema=df.schema.add(StructField('count', IntegerType(), False))).show()

AnalysisException: Cannot resolve column name "count" among (id, v);

{code}
It appears to be looking for the new return field in the input schema?

 

 

 


> applyInPandas doesn't seem to work with StructType output schema 
> -----------------------------------------------------------------
>
>                 Key: SPARK-34348
>                 URL: https://issues.apache.org/jira/browse/SPARK-34348
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 3.0.1
>            Reporter: Raman Srinivasan
>            Priority: Major
>
>  
> {code:java}
> df = spark.createDataFrame(
>     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
>     ("id", "v"))
> def subtract_mean(pdf):
>     # pdf is a pandas.DataFrame
>     pdf['count'] = pdf.shape[0]
>     return pdf{code}
>  
>  
> Using a DDL-formatted string for output schema works fine:
> {code:java}
> df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double, 
> count int").show()
> +---+----+-----+
> | id|   v|count|
> +---+----+-----+
> |  1| 1.0|    2|
> |  1| 2.0|    2|
> |  2| 3.0|    3|
> |  2| 5.0|    3|
> |  2|10.0|    3|
> +---+----+-----+
> {code}
>  
>  
> But using StructType schema (appending a integer count column) fails:
> {code:java}
> df.groupby("id").applyInPandas(subtract_mean, 
> schema=df.schema.add(StructField('count', IntegerType(), False))).show()
> AnalysisException: Cannot resolve column name "count" among (id, v);
> {code}
> It appears to be looking for the new return field in the input schema?
> As a workaround, is there a toDDL method I can use to get the current schema 
> as a DDL string to which I can append the new return fields?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to