Hi Sunitha, Thank you for the reference Jira. It looks like this is the bug I'm hitting. Most of the bugs related to this seems to associate with dataframes derived from the one dataframe (base in this case). In SQL, this is a self-join and dropping d2.label should not affect d1.label. There are other bugs I found these three days that are associated with this type of joins. In one case, if I don't drop the duplicate column BEFORE the join, spark has preferences on the columns from d2 dataframe. I will see if I can replicate in a small program like above.
Best Regards, Jerry On Mon, Mar 28, 2016 at 6:27 PM, Sunitha Kambhampati <skambha...@gmail.com> wrote: > Hi Jerry, > > I think you are running into an issue similar to SPARK-14040 > https://issues.apache.org/jira/browse/SPARK-14040 > > One way to resolve it is to use alias. > > Here is an example that I tried on trunk and I do not see any exceptions. > > val d1=base.where($"label" === 0) as("d1") > val d2=base.where($"label" === 1).as("d2") > > d1.join(d2, $"d1.id" === $"d2.id", > "left_outer").drop($"d2.label").select($"d1.label") > > > Hope this helps some. > > Best regards, > Sunitha. > > On Mar 28, 2016, at 2:34 PM, Jerry Lam <chiling...@gmail.com> wrote: > > Hi spark users and developers, > > I'm using spark 1.5.1 (I have no choice because this is what we used). I > ran into some very unexpected behaviour when I did some join operations > lately. I cannot post my actual code here and the following code is not for > practical reasons but it should demonstrate the issue. > > val base = sc.parallelize(( 0 to 49).map(i =>(i,0)) ++ (50 to > 99).map((_,1))).toDF("id", "label") > val d1=base.where($"label" === 0) > val d2=base.where($"label" === 1) > d1.join(d2, d1("id") === d2("id"), > "left_outer").drop(d2("label")).select(d1("label")) > > > The above code will throw an exception saying the column label is not > found. Do you have a reason for throwing an exception when the column has > not been dropped for d1("label")? > > Best Regards, > > Jerry > > >