Hi Spark Users,
The following code snippet has an "attribute missing" error while the
attribute exists. This bug is triggered by a particular sequence of of
"select", "groupby" and "join". Note that if I take away the "select" in
#line B, the code runs without error. However, the "select" in #line B
includes all columns in the dataframe and hence should not affect the
final result.
import pyspark.sql.functions as F
df =
spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':True,'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}])
df = df.withColumnRenamed("k","kk")\
.select("ID","score","LABEL","kk") #line B
df_t =
df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1)
df = df.join(df_t.select("ID"),["ID"])
df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
df = df.join(df_sw, ["ID","kk"])