Hello, I fail to see how an equi-join on the key columns is different than the cogroup you propose.
I think the accepted answer can shed some light: https://stackoverflow.com/questions/43960583/whats-the-difference-between-join-and-cogroup-in-apache-spark Now you apply an udf on each iterable, one per key value (obtained with cogroup). You can achieve the same by: 1) join df1 and df2 on the key you want, 2) apply "groupby" on such key 3) finally apply a udaf (you can have a look here if you are not familiar with them https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html), that will process each group "in isolation". HTH, Alessandro On Tue, 19 Feb 2019 at 23:30, Li Jin <ice.xell...@gmail.com> wrote: > Hi, > > We have been using Pyspark's groupby().apply() quite a bit and it has been > very helpful in integrating Spark with our existing pandas-heavy libraries. > > Recently, we have found more and more cases where groupby().apply() is not > sufficient - In some cases, we want to group two dataframes by the same > key, and apply a function which takes two pd.DataFrame (also returns a > pd.DataFrame) for each key. This feels very much like the "cogroup" > operation in the RDD API. > > It would be great to be able to do sth like this: (not actual API, just to > explain the use case): > > @pandas_udf(return_schema, ...) > def my_udf(pdf1, pdf2) > # pdf1 and pdf2 are the subset of the original dataframes that is > associated with a particular key > result = ... # some code that uses pdf1 and pdf2 > return result > > df3 = cogroup(df1, df2, key='some_key').apply(my_udf) > > I have searched around the problem and some people have suggested to join > the tables first. However, it's often not the same pattern and hard to get > it to work by using joins. > > I wonder what are people's thought on this? > > Li > >