Hello Folks!

I was looking into pyspark dataframe cogroup + applyInPandas apis.

As mentioned in the spark 
doc<https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.PandasCogroupedOps.applyInPandas.html>,
 the pandas udf to be applied by applyInPandas takes two pandas.DataFrames and 
returns one another pandas.DataFrame.

I was wondering whether there are ways to make the pandas udf accepting more 
than 2 pandas.DataFrames as arguments when doing cogroup + applyInPandas, hence 
put my question here to 
https://stackoverflow.com/questions/75022710/is-it-possible-to-use-cogroup-applyinpandas-for-more-than-2-pyspark-dataframes.

[https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-i...@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/75022710/is-it-possible-to-use-cogroup-applyinpandas-for-more-than-2-pyspark-dataframes>
python - Is it possible to use cogroup + applyInPandas for more than 2 pyspark 
dataframes as input? - Stack 
Overflow<https://stackoverflow.com/questions/75022710/is-it-possible-to-use-cogroup-applyinpandas-for-more-than-2-pyspark-dataframes>
As mentioned in the spark doc, the function to be applied by applyInPandas 
takes two pandas.DataFrames and returns one another pandas.DataFrame. Hence the 
following can be done: def function_with_two_args(pdf1, pdf2): result_pdf = <do 
this and that> return result_pdf 
df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( 
function_with_two_args, schema="time int, id int, v1 double, v2 ...
stackoverflow.com
Answered by the user D3V, there is a workaround for it. In the meanwhile both 
of us think it is nice to have the feature which can pass more than 2 pandas 
dataframe to the Pandas UDF which is passed to the cogroup.applyInPandas. What 
do you guys think about this? Can we create a jira ticket about this feature?

Cheers,
Jamon

Reply via email to