A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

Shiyuan Mon, 09 Apr 2018 10:51:01 -0700

Hi Spark Users,
    The following code snippet has an "attribute missing" error while the
attribute exists.  This bug is  triggered by a particular sequence of of
"select", "groupby" and "join".  Note that if I take away the "select"  in
#line B,  the code runs without error.   However, the "select" in #line B
includes all columns in the dataframe and hence should  not affect the
final result.



import pyspark.sql.functions as F
df =
spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':True,'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}])

df = df.withColumnRenamed("k","kk")\
  .select("ID","score","LABEL","kk")    #line B

df_t =
df.groupby("ID").agg(F.countDistinct("LABEL").alias("nL")).filter(F.col("nL")>1)
df = df.join(df_t.select("ID"),["ID"])
df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
df = df.join(df_sw, ["ID","kk"])

A bug triggered by a particular sequence of "select", "groupby" and "join" in Spark 2.3.0

Reply via email to