Re: Cant join same dataframe twice ?
when working with Dataframes and using explain to debug I observed that Spark gives different tagging number for the same dataframe columns Like in this case val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") For instance,df2 has 2 columns ,df2 columns gets tagging like df2Col1#4 ,df2Col2#5 df3 has 4 columns ,df3 columns gets tagging like df3Col1#6,df3Col2#7,df3Col3#8,df3Col4#9 Now after joining df1 columns tagging will be df2Co1l#10,df2Col2#11,df3Col1#12,df3Col2#13,df3Col3#14,df3Col4#15 Now when df1 again with df3 the df3 columns tagging changed df2Co1l#16,df2Col2#17,df3Col1#18 ,df3Col2#19,df3Col3#20,df3Col4#21,df3Col2#23,df3Col3#24,df3Col4#25 but joining df3Col1#12 would be referring to the previous dataframe and that causes the issue . Thanks, Divya On 27 April 2016 at 23:55, Ted Yu <yuzhih...@gmail.com> wrote: > I wonder if Spark can provide better support for this case. > > The following schema is not user friendly (shown previsouly): > > StructField(b,IntegerType,false), StructField(b,IntegerType,false) > > Except for 'select *', there is no way for user to query any of the two > fields. > > On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Based on my example, how about renaming columns? >> >> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), >> df2("b").as("2-b")) >> val df4 = df3.join(df2, df3("2-b") === df2("b")) >> >> // maropu >> >> On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Correct Takeshi >>> Even I am facing the same issue . >>> >>> How to avoid the ambiguity ? >>> >>> >>> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I tried; >>>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >>>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >>>> val df3 = df1.join(df2, "a") >>>> val df4 = df3.join(df2, "b") >>>> >>>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is >>>> ambiguous, could be: b#6, b#14.; >>>> If same case, this message makes sense and this is clear. >>>> >>>> Thought? >>>> >>>> // maropu >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> >>>> wrote: >>>> >>>>> Also, check the column names of df1 ( after joining df2 and df3 ). >>>>> >>>>> Prasad. >>>>> >>>>> From: Ted Yu >>>>> Date: Monday, April 25, 2016 at 8:35 PM >>>>> To: Divya Gehlot >>>>> Cc: "user @spark" >>>>> Subject: Re: Cant join same dataframe twice ? >>>>> >>>>> Can you show us the structure of df2 and df3 ? >>>>> >>>>> Thanks >>>>> >>>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com >>>>> > wrote: >>>>> >>>>>> Hi, >>>>>> I am using Spark 1.5.2 . >>>>>> I have a use case where I need to join the same dataframe twice on >>>>>> two different columns. >>>>>> I am getting error missing Columns >>>>>> >>>>>> For instance , >>>>>> val df1 = df2.join(df3,"Column1") >>>>>> Below throwing error missing columns >>>>>> val df 4 = df1.join(df3,"Column2") >>>>>> >>>>>> Is the bug or valid scenario ? >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Thanks, >>>>>> Divya >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> --- >>>> Takeshi Yamamuro >>>> >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > >
Re: Cant join same dataframe twice ?
I wonder if Spark can provide better support for this case. The following schema is not user friendly (shown previsouly): StructField(b,IntegerType,false), StructField(b,IntegerType,false) Except for 'select *', there is no way for user to query any of the two fields. On Tue, Apr 26, 2016 at 10:17 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > Based on my example, how about renaming columns? > > val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), > df2("b").as("2-b")) > val df4 = df3.join(df2, df3("2-b") === df2("b")) > > // maropu > > On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Correct Takeshi >> Even I am facing the same issue . >> >> How to avoid the ambiguity ? >> >> >> On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com> >> wrote: >> >>> Hi, >>> >>> I tried; >>> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >>> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >>> val df3 = df1.join(df2, "a") >>> val df4 = df3.join(df2, "b") >>> >>> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is >>> ambiguous, could be: b#6, b#14.; >>> If same case, this message makes sense and this is clear. >>> >>> Thought? >>> >>> // maropu >>> >>> >>> >>> >>> >>> >>> >>> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> >>> wrote: >>> >>>> Also, check the column names of df1 ( after joining df2 and df3 ). >>>> >>>> Prasad. >>>> >>>> From: Ted Yu >>>> Date: Monday, April 25, 2016 at 8:35 PM >>>> To: Divya Gehlot >>>> Cc: "user @spark" >>>> Subject: Re: Cant join same dataframe twice ? >>>> >>>> Can you show us the structure of df2 and df3 ? >>>> >>>> Thanks >>>> >>>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> I am using Spark 1.5.2 . >>>>> I have a use case where I need to join the same dataframe twice on two >>>>> different columns. >>>>> I am getting error missing Columns >>>>> >>>>> For instance , >>>>> val df1 = df2.join(df3,"Column1") >>>>> Below throwing error missing columns >>>>> val df 4 = df1.join(df3,"Column2") >>>>> >>>>> Is the bug or valid scenario ? >>>>> >>>>> >>>>> >>>>> >>>>> Thanks, >>>>> Divya >>>>> >>>> >>>> >>> >>> >>> -- >>> --- >>> Takeshi Yamamuro >>> >> >> > > > -- > --- > Takeshi Yamamuro >
Re: Cant join same dataframe twice ?
Based on my example, how about renaming columns? val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df3 = df1.join(df2, "a").select($"a", df1("b").as("1-b"), df2("b").as("2-b")) val df4 = df3.join(df2, df3("2-b") === df2("b")) // maropu On Wed, Apr 27, 2016 at 1:58 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: > Correct Takeshi > Even I am facing the same issue . > > How to avoid the ambiguity ? > > > On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com> wrote: > >> Hi, >> >> I tried; >> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a") >> val df4 = df3.join(df2, "b") >> >> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is >> ambiguous, could be: b#6, b#14.; >> If same case, this message makes sense and this is clear. >> >> Thought? >> >> // maropu >> >> >> >> >> >> >> >> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> >> wrote: >> >>> Also, check the column names of df1 ( after joining df2 and df3 ). >>> >>> Prasad. >>> >>> From: Ted Yu >>> Date: Monday, April 25, 2016 at 8:35 PM >>> To: Divya Gehlot >>> Cc: "user @spark" >>> Subject: Re: Cant join same dataframe twice ? >>> >>> Can you show us the structure of df2 and df3 ? >>> >>> Thanks >>> >>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> I am using Spark 1.5.2 . >>>> I have a use case where I need to join the same dataframe twice on two >>>> different columns. >>>> I am getting error missing Columns >>>> >>>> For instance , >>>> val df1 = df2.join(df3,"Column1") >>>> Below throwing error missing columns >>>> val df 4 = df1.join(df3,"Column2") >>>> >>>> Is the bug or valid scenario ? >>>> >>>> >>>> >>>> >>>> Thanks, >>>> Divya >>>> >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > > -- --- Takeshi Yamamuro
Re: Cant join same dataframe twice ?
Correct Takeshi Even I am facing the same issue . How to avoid the ambiguity ? On 27 April 2016 at 11:54, Takeshi Yamamuro <linguin@gmail.com> wrote: > Hi, > > I tried; > val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df3 = df1.join(df2, "a") > val df4 = df3.join(df2, "b") > > And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is > ambiguous, could be: b#6, b#14.; > If same case, this message makes sense and this is clear. > > Thought? > > // maropu > > > > > > > > On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> > wrote: > >> Also, check the column names of df1 ( after joining df2 and df3 ). >> >> Prasad. >> >> From: Ted Yu >> Date: Monday, April 25, 2016 at 8:35 PM >> To: Divya Gehlot >> Cc: "user @spark" >> Subject: Re: Cant join same dataframe twice ? >> >> Can you show us the structure of df2 and df3 ? >> >> Thanks >> >> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Hi, >>> I am using Spark 1.5.2 . >>> I have a use case where I need to join the same dataframe twice on two >>> different columns. >>> I am getting error missing Columns >>> >>> For instance , >>> val df1 = df2.join(df3,"Column1") >>> Below throwing error missing columns >>> val df 4 = df1.join(df3,"Column2") >>> >>> Is the bug or valid scenario ? >>> >>> >>> >>> >>> Thanks, >>> Divya >>> >> >> > > > -- > --- > Takeshi Yamamuro >
Re: Cant join same dataframe twice ?
Yeah, I think so. This is a kind of common mistakes. // maropu On Wed, Apr 27, 2016 at 1:05 PM, Ted Yu <yuzhih...@gmail.com> wrote: > The ambiguity came from: > > scala> df3.schema > res0: org.apache.spark.sql.types.StructType = > StructType(StructField(a,IntegerType,false), > StructField(b,IntegerType,false), StructField(b,IntegerType,false)) > > On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com> > wrote: > >> Hi, >> >> I tried; >> val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") >> val df3 = df1.join(df2, "a") >> val df4 = df3.join(df2, "b") >> >> And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is >> ambiguous, could be: b#6, b#14.; >> If same case, this message makes sense and this is clear. >> >> Thought? >> >> // maropu >> >> >> >> >> >> >> >> On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> >> wrote: >> >>> Also, check the column names of df1 ( after joining df2 and df3 ). >>> >>> Prasad. >>> >>> From: Ted Yu >>> Date: Monday, April 25, 2016 at 8:35 PM >>> To: Divya Gehlot >>> Cc: "user @spark" >>> Subject: Re: Cant join same dataframe twice ? >>> >>> Can you show us the structure of df2 and df3 ? >>> >>> Thanks >>> >>> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> I am using Spark 1.5.2 . >>>> I have a use case where I need to join the same dataframe twice on two >>>> different columns. >>>> I am getting error missing Columns >>>> >>>> For instance , >>>> val df1 = df2.join(df3,"Column1") >>>> Below throwing error missing columns >>>> val df 4 = df1.join(df3,"Column2") >>>> >>>> Is the bug or valid scenario ? >>>> >>>> >>>> >>>> >>>> Thanks, >>>> Divya >>>> >>> >>> >> >> >> -- >> --- >> Takeshi Yamamuro >> > > -- --- Takeshi Yamamuro
Re: Cant join same dataframe twice ?
The ambiguity came from: scala> df3.schema res0: org.apache.spark.sql.types.StructType = StructType(StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(b,IntegerType,false)) On Tue, Apr 26, 2016 at 8:54 PM, Takeshi Yamamuro <linguin@gmail.com> wrote: > Hi, > > I tried; > val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") > val df3 = df1.join(df2, "a") > val df4 = df3.join(df2, "b") > > And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is > ambiguous, could be: b#6, b#14.; > If same case, this message makes sense and this is clear. > > Thought? > > // maropu > > > > > > > > On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> > wrote: > >> Also, check the column names of df1 ( after joining df2 and df3 ). >> >> Prasad. >> >> From: Ted Yu >> Date: Monday, April 25, 2016 at 8:35 PM >> To: Divya Gehlot >> Cc: "user @spark" >> Subject: Re: Cant join same dataframe twice ? >> >> Can you show us the structure of df2 and df3 ? >> >> Thanks >> >> On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Hi, >>> I am using Spark 1.5.2 . >>> I have a use case where I need to join the same dataframe twice on two >>> different columns. >>> I am getting error missing Columns >>> >>> For instance , >>> val df1 = df2.join(df3,"Column1") >>> Below throwing error missing columns >>> val df 4 = df1.join(df3,"Column2") >>> >>> Is the bug or valid scenario ? >>> >>> >>> >>> >>> Thanks, >>> Divya >>> >> >> > > > -- > --- > Takeshi Yamamuro >
Re: Cant join same dataframe twice ?
Hi, I tried; val df1 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df2 = Seq((1, 1), (2, 2), (3, 3)).toDF("a", "b") val df3 = df1.join(df2, "a") val df4 = df3.join(df2, "b") And I got; org.apache.spark.sql.AnalysisException: Reference 'b' is ambiguous, could be: b#6, b#14.; If same case, this message makes sense and this is clear. Thought? // maropu On Wed, Apr 27, 2016 at 6:09 AM, Prasad Ravilla <pras...@slalom.com> wrote: > Also, check the column names of df1 ( after joining df2 and df3 ). > > Prasad. > > From: Ted Yu > Date: Monday, April 25, 2016 at 8:35 PM > To: Divya Gehlot > Cc: "user @spark" > Subject: Re: Cant join same dataframe twice ? > > Can you show us the structure of df2 and df3 ? > > Thanks > > On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> I am using Spark 1.5.2 . >> I have a use case where I need to join the same dataframe twice on two >> different columns. >> I am getting error missing Columns >> >> For instance , >> val df1 = df2.join(df3,"Column1") >> Below throwing error missing columns >> val df 4 = df1.join(df3,"Column2") >> >> Is the bug or valid scenario ? >> >> >> >> >> Thanks, >> Divya >> > > -- --- Takeshi Yamamuro
Re: Cant join same dataframe twice ?
Also, check the column names of df1 ( after joining df2 and df3 ). Prasad. From: Ted Yu Date: Monday, April 25, 2016 at 8:35 PM To: Divya Gehlot Cc: "user @spark" Subject: Re: Cant join same dataframe twice ? Can you show us the structure of df2 and df3 ? Thanks On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlot <divya.htco...@gmail.com<mailto:divya.htco...@gmail.com>> wrote: Hi, I am using Spark 1.5.2 . I have a use case where I need to join the same dataframe twice on two different columns. I am getting error missing Columns For instance , val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") Is the bug or valid scenario ? Thanks, Divya
Re: Cant join same dataframe twice ?
Can you show us the structure of df2 and df3 ? Thanks On Mon, Apr 25, 2016 at 8:23 PM, Divya Gehlotwrote: > Hi, > I am using Spark 1.5.2 . > I have a use case where I need to join the same dataframe twice on two > different columns. > I am getting error missing Columns > > For instance , > val df1 = df2.join(df3,"Column1") > Below throwing error missing columns > val df 4 = df1.join(df3,"Column2") > > Is the bug or valid scenario ? > > > > > Thanks, > Divya >
Cant join same dataframe twice ?
Hi, I am using Spark 1.5.2 . I have a use case where I need to join the same dataframe twice on two different columns. I am getting error missing Columns For instance , val df1 = df2.join(df3,"Column1") Below throwing error missing columns val df 4 = df1.join(df3,"Column2") Is the bug or valid scenario ? Thanks, Divya