Hello Folks! I was looking into pyspark dataframe cogroup + applyInPandas apis.
As mentioned in the spark doc<https://spark.apache.org/docs/3.2.0/api/python/reference/api/pyspark.sql.PandasCogroupedOps.applyInPandas.html>, the pandas udf to be applied by applyInPandas takes two pandas.DataFrames and returns one another pandas.DataFrame. I was wondering whether there are ways to make the pandas udf accepting more than 2 pandas.DataFrames as arguments when doing cogroup + applyInPandas, hence put my question here to https://stackoverflow.com/questions/75022710/is-it-possible-to-use-cogroup-applyinpandas-for-more-than-2-pyspark-dataframes. [https://cdn.sstatic.net/Sites/stackoverflow/Img/apple-touch-i...@2.png?v=73d79a89bded]<https://stackoverflow.com/questions/75022710/is-it-possible-to-use-cogroup-applyinpandas-for-more-than-2-pyspark-dataframes> python - Is it possible to use cogroup + applyInPandas for more than 2 pyspark dataframes as input? - Stack Overflow<https://stackoverflow.com/questions/75022710/is-it-possible-to-use-cogroup-applyinpandas-for-more-than-2-pyspark-dataframes> As mentioned in the spark doc, the function to be applied by applyInPandas takes two pandas.DataFrames and returns one another pandas.DataFrame. Hence the following can be done: def function_with_two_args(pdf1, pdf2): result_pdf = <do this and that> return result_pdf df1.groupby("id").cogroup(df2.groupby("id")).applyInPandas( function_with_two_args, schema="time int, id int, v1 double, v2 ... stackoverflow.com Answered by the user D3V, there is a workaround for it. In the meanwhile both of us think it is nice to have the feature which can pass more than 2 pandas dataframe to the Pandas UDF which is passed to the cogroup.applyInPandas. What do you guys think about this? Can we create a jira ticket about this feature? Cheers, Jamon