Re: join two DataFrames, same column name
You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) FYI, this does not ultimately work as the * still includes column_id and you cannot have two columns of that name in the joined DataFrame. So I ended up aliasing both sides of the join. On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust mich...@databricks.com wrote: You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) I'm also hoping to resolve SPARK-6376 https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which will let you do something like: var df1 = sqlContext.sql(select * from table1).as(t1) var df2 = sqlContext.sql(select * from table2).as(t2) df1.join(df2, df1(column_id) === df2(column_id)).select(t1.column_id) Finally, there is SPARK-6380 https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify this particular case. Michael On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an exception because it finds the two columns of the same name to be ambiguous. Is there a way to specify which side of the join comes from data frame A and which comes from B? var df1 = sqlContext.sql(select * from table1) var df2 = sqlContext.sql(select * from table2) df1.join(df2, df1(column_id) === df2(column_id))
Re: join two DataFrames, same column name
Michael, thank you for the workaround and for letting me know of the upcoming enhancements, both of which sound appealing. On Sun, Mar 22, 2015 at 1:25 PM, Michael Armbrust mich...@databricks.com wrote: You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) I'm also hoping to resolve SPARK-6376 https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which will let you do something like: var df1 = sqlContext.sql(select * from table1).as(t1) var df2 = sqlContext.sql(select * from table2).as(t2) df1.join(df2, df1(column_id) === df2(column_id)).select(t1.column_id) Finally, there is SPARK-6380 https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify this particular case. Michael On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an exception because it finds the two columns of the same name to be ambiguous. Is there a way to specify which side of the join comes from data frame A and which comes from B? var df1 = sqlContext.sql(select * from table1) var df2 = sqlContext.sql(select * from table2) df1.join(df2, df1(column_id) === df2(column_id))
Re: join two DataFrames, same column name
You can include * and a column alias in the same select clause var df1 = sqlContext.sql(select *, column_id AS table1_id from table1) I'm also hoping to resolve SPARK-6376 https://issues.apache.org/jira/browse/SPARK-6376 before Spark 1.3.1 which will let you do something like: var df1 = sqlContext.sql(select * from table1).as(t1) var df2 = sqlContext.sql(select * from table2).as(t2) df1.join(df2, df1(column_id) === df2(column_id)).select(t1.column_id) Finally, there is SPARK-6380 https://issues.apache.org/jira/browse/SPARK-6380 that hopes to simplify this particular case. Michael On Sat, Mar 21, 2015 at 3:02 PM, Eric Friedman eric.d.fried...@gmail.com wrote: I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an exception because it finds the two columns of the same name to be ambiguous. Is there a way to specify which side of the join comes from data frame A and which comes from B? var df1 = sqlContext.sql(select * from table1) var df2 = sqlContext.sql(select * from table2) df1.join(df2, df1(column_id) === df2(column_id))
join two DataFrames, same column name
I have a couple of data frames that I pulled from SparkSQL and the primary key of one is a foreign key of the same name in the other. I'd rather not have to specify each column in the SELECT statement just so that I can rename this single column. When I try to join the data frames, I get an exception because it finds the two columns of the same name to be ambiguous. Is there a way to specify which side of the join comes from data frame A and which comes from B? var df1 = sqlContext.sql(select * from table1) var df2 = sqlContext.sql(select * from table2) df1.join(df2, df1(column_id) === df2(column_id))