Hi Spark-users: I have a dataframe "df_t" which was generated from other dataframes by several transformations. And then I did something very simple, just counting the rows, that is the following code:
(A) df_t_1 = df_t.groupby(["Id","key"]).count().withColumnRenamed("count", "cnt1") df_t_2 = df_t.groupby("Id").count().withColumnRenamed("count", "cnt2") df_t_3 = df_t_1.join(df_t_2, ["Id"]) df_t.join(df_t_3, ["Id","key"]) When I run this query, I got the error that "key" is missing during joining. However, the column "key" is clearly in the dataframe dt. What is strange is that: if I first do this: data = df_t.collect(); df_t = spark.createDataFrame(data); (B) then (A) can run without error. However, the code (B) should not change the dataframe dt_t at all. Why the snippet (A) can run with (B) but failed without (B)? Also, A different joining sequence can also complete without error: (C) df_t_1 = df_t.groupby(["Id","key"]).count().withColumnRenamed("count", "cnt1") df_t_2 = df_t.groupby("Id").count().withColumnRenamed("count", "cnt2") df_t.join(df_t_1, ["Id","key"]).join(df_t_2, ["Id"]) But (A) and (C) are conceptually the same and should produce the same result. What could possibly go wrong here? Any hints to track down the problem is appreciated. I am using spark 2.1.