Re: column expression in left outer join for DataFrame
Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: column expression in left outer join for DataFrame
Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: column expression in left outer join for DataFrame
Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I thought df_2(country)===df_1(country) indicates that the country field in the 2 dataframes should match and df_2(country) is the equivalent of df_2.country in SQL, while df_1(country) is the equivalent of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark 1.2.0 I have used the same logic using SparkSQL and Tables ( e.g. WHERE tab1.country = tab2.country) and had no problems getting the correct result. thanks On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com wrote: Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: column expression in left outer join for DataFrame
Thats a good question. In this particular example, it is really only internal implementation details that make it ambiguous. However, fixing this was a very large change so we have defered it to Spark 1.4 and instead print a warning now when you construct trivially equal expressions. I can try to explain a little bit about why solving this generally is (mostly) impossible. Consider the following: val df = sqlContext.load(...) val df1 = df val df2 = df df1.join(df2, df1(a) === df2(a)) Compared with SELECT * FROM df df1 JOIN df df2 WHERE df1.a = df2.a In the first example, the assigning of df to df1 and df2 is completely transparent to the catalyst optimizer as it is happening in Scala code. This means that df1(a) and df2(a) are completely indistinguishable to us (at least without crazy macro magic). In contrast, the aliasing is visible to the optimizer when are doing it in SQL instead of Scala and thus we can differentiate. In your case you are doing transformations, and we could assign new unique ids each time a transformation is done. However, we don't do this today, and its a pretty big change. There is a JIRA for this: SPARK-6231 https://issues.apache.org/jira/browse/SPARK-6231 On Wed, Mar 25, 2015 at 11:47 AM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I am not clear about why the query is ambiguous. val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I thought df_2(country)===df_1(country) indicates that the country field in the 2 dataframes should match and df_2(country) is the equivalent of df_2.country in SQL, while df_1(country) is the equivalent of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark 1.2.0 I have used the same logic using SparkSQL and Tables ( e.g. WHERE tab1.country = tab2.country) and had no problems getting the correct result. thanks On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com wrote: Unfortunately you are now hitting a bug (that is fixed in master and will be released in 1.3.1 hopefully next week). However, even with that your query is still ambiguous and you will need to use aliases: val df_1 = df.filter( df(event) === 0) . select(country, cnt).as(a) val df_2 = df.filter( df(event) === 3) . select(country, cnt).as(b) val both = df_2.join(df_1, $a.country === $b.country), left_outer) On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote: Hi, Thanks for your response. I modified my code as per your suggestion, but now I am getting a runtime error. Here's my code: val df_1 = df.filter( df(event) === 0) . select(country, cnt) val df_2 = df.filter( df(event) === 3) . select(country, cnt) df_1.show() //produces the following output : // countrycnt // tw 3000 // uk 2000 // us 1000 df_2.show() //produces the following output : // countrycnt // tw 25 // uk 200 // us 95 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer) I am getting the following error when executing the join statement: java.util.NoSuchElementException: next on empty iterator. This error seems to be originating at DataFrame.join (line 133 in DataFrame.scala). The show() results show that both dataframes do have columns named country and that they are non-empty. I also tried the simpler join ( i.e. df_2.join(df_1) ) and got the same error stated above. I would like to know what is wrong with the join statement above. thanks On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com wrote: You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct
Re: column expression in left outer join for DataFrame
You need to use `===`, so that you are constructing a column expression instead of evaluating the standard scala equality method. Calling methods to access columns (i.e. df.county is only supported in python). val join_df = df1.join( df2, df1(country) === df2(country), left_outer) On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote: Hi, I am trying to port some code that was working in Spark 1.2.0 on the latest version, Spark 1.3.0. This code involves a left outer join between two SchemaRDDs which I am now trying to change to a left outer join between 2 DataFrames. I followed the example for left outer join of DataFrame at https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html Here's my code, where df1 and df2 are the 2 dataframes I am joining on the country field: val join_df = df1.join( df2, df1.country == df2.country, left_outer) But I got a compilation error that value country is not a member of sql.DataFrame I also tried the following: val join_df = df1.join( df2, df1(country) == df2(country), left_outer) I got a compilation error that it is a Boolean whereas a Column is required. So what is the correct Column expression I need to provide for joining the 2 dataframes on a specific field ? thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org