Hi,
what I am curious about is the reassignment of df.
Can you please look into the explain plan of df after the statement df =
df.join(df_t.select("ID"),["ID"])? And then compare with the explain plan
of df1 after the statement df1 = df.join(df_t.select("ID"),["ID"])?
Its late here, but I am yet to go through this completely. But I think
that SPARK does throw a warning mentioning us to use Row instead of
Dictionary.
It will be of help if you could kindly try using the below statement and go
through your used case once again (I am yet to go through all the lines):
from pyspark.sql import Row
df = spark.createDataFrame([Row(score = 1.0,ID="abc",LABEL=True,k=2),
Row(score = 1.0,ID="abc",LABEL=True,k=3)])
Regards,
Gourav Sengupta
On Mon, Apr 9, 2018 at 6:50 PM, Shiyuan <[email protected]> wrote:
> Hi Spark Users,
> The following code snippet has an "attribute missing" error while the
> attribute exists. This bug is triggered by a particular sequence of of
> "select", "groupby" and "join". Note that if I take away the "select" in
> #line B, the code runs without error. However, the "select" in #line B
> includes all columns in the dataframe and hence should not affect the
> final result.
>
>
> import pyspark.sql.functions as F
> df = spark.createDataFrame([{'score':1.0,'ID':'abc','LABEL':
> True,'k':2},{'score':1.0,'ID':'abc','LABEL':False,'k':3}])
>
> df = df.withColumnRenamed("k","kk")\
> .select("ID","score","LABEL","kk") #line B
>
> df_t = df.groupby("ID").agg(F.countDistinct("LABEL").alias("
> nL")).filter(F.col("nL")>1)
> df = df.join(df_t.select("ID"),["ID"])
> df_sw = df.groupby(["ID","kk"]).count().withColumnRenamed("count", "cnt1")
> df = df.join(df_sw, ["ID","kk"])
>