Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi,

Thanks for your response. I modified my code as per your suggestion, but
now I am getting a runtime error. Here's my code:

val df_1 = df.filter( df(event) === 0)
  . select(country, cnt)

val df_2 = df.filter( df(event) === 3)
  . select(country, cnt)

df_1.show()
//produces the following output :
// countrycnt
//   tw   3000
//   uk   2000
//   us   1000

df_2.show()
//produces the following output :
// countrycnt
//   tw   25
//   uk   200
//   us   95

val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer)

I am getting the following error when executing the join statement:

java.util.NoSuchElementException: next on empty iterator.

This error seems to be originating at DataFrame.join (line 133 in
DataFrame.scala).

The show() results show that both dataframes do have columns named
country and that they are non-empty. I also tried the simpler join ( i.e.
df_2.join(df_1) ) and got the same error stated above.

I would like to know what is wrong with the join statement above.

thanks
























On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
Unfortunately you are now hitting a bug (that is fixed in master and will
be released in 1.3.1 hopefully next week).  However, even with that your
query is still ambiguous and you will need to use aliases:

val df_1 = df.filter( df(event) === 0)
  . select(country, cnt).as(a)
val df_2 = df.filter( df(event) === 3)
  . select(country, cnt).as(b)
val both = df_2.join(df_1, $a.country === $b.country), left_outer)



On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote:

 Hi,

 Thanks for your response. I modified my code as per your suggestion, but
 now I am getting a runtime error. Here's my code:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt)

 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt)

 df_1.show()
 //produces the following output :
 // countrycnt
 //   tw   3000
 //   uk   2000
 //   us   1000

 df_2.show()
 //produces the following output :
 // countrycnt
 //   tw   25
 //   uk   200
 //   us   95

 val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer)

 I am getting the following error when executing the join statement:

 java.util.NoSuchElementException: next on empty iterator.

 This error seems to be originating at DataFrame.join (line 133 in
 DataFrame.scala).

 The show() results show that both dataframes do have columns named
 country and that they are non-empty. I also tried the simpler join ( i.e.
 df_2.join(df_1) ) and got the same error stated above.

 I would like to know what is wrong with the join statement above.

 thanks
























 On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
 wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on
 the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: column expression in left outer join for DataFrame

2015-03-25 Thread S Krishna
Hi,

Thanks for your response.  I am not clear about why the query is ambiguous.

val both = df_2.join(df_1, df_2(country)===df_1(country), left_outer)

I thought df_2(country)===df_1(country) indicates that the country
field in the 2 dataframes should match and df_2(country) is the
equivalent of df_2.country in SQL, while  df_1(country) is the equivalent
of df_1.country in SQL. So I am not sure why it is ambiguous. In Spark
1.2.0 I have used the same logic using SparkSQL  and Tables ( e.g.  WHERE
tab1.country = tab2.country)  and had no problems getting the correct
result.

thanks





On Wed, Mar 25, 2015 at 11:05 AM, Michael Armbrust mich...@databricks.com
wrote:

 Unfortunately you are now hitting a bug (that is fixed in master and will
 be released in 1.3.1 hopefully next week).  However, even with that your
 query is still ambiguous and you will need to use aliases:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt).as(a)
 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt).as(b)
 val both = df_2.join(df_1, $a.country === $b.country), left_outer)



 On Tue, Mar 24, 2015 at 11:57 PM, S Krishna skrishna...@gmail.com wrote:

 Hi,

 Thanks for your response. I modified my code as per your suggestion, but
 now I am getting a runtime error. Here's my code:

 val df_1 = df.filter( df(event) === 0)
   . select(country, cnt)

 val df_2 = df.filter( df(event) === 3)
   . select(country, cnt)

 df_1.show()
 //produces the following output :
 // countrycnt
 //   tw   3000
 //   uk   2000
 //   us   1000

 df_2.show()
 //produces the following output :
 // countrycnt
 //   tw   25
 //   uk   200
 //   us   95

 val both = df_2.join(df_1, df_2(country)===df_1(country),
 left_outer)

 I am getting the following error when executing the join statement:

 java.util.NoSuchElementException: next on empty iterator.

 This error seems to be originating at DataFrame.join (line 133 in
 DataFrame.scala).

 The show() results show that both dataframes do have columns named
 country and that they are non-empty. I also tried the simpler join ( i.e.
 df_2.join(df_1) ) and got the same error stated above.

 I would like to know what is wrong with the join statement above.

 thanks
























 On Tue, Mar 24, 2015 at 6:08 PM, Michael Armbrust mich...@databricks.com
  wrote:

 You need to use `===`, so that you are constructing a column expression
 instead of evaluating the standard scala equality method.  Calling methods
 to access columns (i.e. df.county is only supported in python).

 val join_df =  df1.join( df2, df1(country) === df2(country),
 left_outer)

 On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the
 latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between
 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on
 the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country,
 left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: column expression in left outer join for DataFrame

2015-03-25 Thread Michael Armbrust
 Column expression I need to provide for joining
 the 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








column expression in left outer join for DataFrame

2015-03-24 Thread SK
Hi,

I am trying to port some code that was working in Spark 1.2.0 on the latest
version, Spark 1.3.0. This code involves a left outer join between two
SchemaRDDs which I am now trying to change to a left outer join between 2
DataFrames. I followed the example  for left outer join of DataFrame at
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Here's my code, where df1 and df2 are the 2 dataframes I am joining on the
country field:

 val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

But I got a compilation error that value  country is not a member of
sql.DataFrame

I  also tried the following:
 val join_df =  df1.join( df2, df1(country) == df2(country),
left_outer)

I got a compilation error that it is a Boolean whereas a Column is required. 

So what is the correct Column expression I need to provide for joining the 2
dataframes on a specific field ?

thanks








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: column expression in left outer join for DataFrame

2015-03-24 Thread Michael Armbrust
You need to use `===`, so that you are constructing a column expression
instead of evaluating the standard scala equality method.  Calling methods
to access columns (i.e. df.county is only supported in python).

val join_df =  df1.join( df2, df1(country) === df2(country),
left_outer)

On Tue, Mar 24, 2015 at 5:50 PM, SK skrishna...@gmail.com wrote:

 Hi,

 I am trying to port some code that was working in Spark 1.2.0 on the latest
 version, Spark 1.3.0. This code involves a left outer join between two
 SchemaRDDs which I am now trying to change to a left outer join between 2
 DataFrames. I followed the example  for left outer join of DataFrame at

 https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

 Here's my code, where df1 and df2 are the 2 dataframes I am joining on the
 country field:

  val join_df =  df1.join( df2,  df1.country == df2.country, left_outer)

 But I got a compilation error that value  country is not a member of
 sql.DataFrame

 I  also tried the following:
  val join_df =  df1.join( df2, df1(country) == df2(country),
 left_outer)

 I got a compilation error that it is a Boolean whereas a Column is
 required.

 So what is the correct Column expression I need to provide for joining the
 2
 dataframes on a specific field ?

 thanks








 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/column-expression-in-left-outer-join-for-DataFrame-tp22209.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org