[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15594989#comment-15594989 ] Ashish Shrowty commented on SPARK-17709: [~sowen], [~smilegator] - I confirmed that this is not a problem in 2.0.1. Sorry .. forgot to come back and post my finding. Thanks for your help guys! > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty >Priority: Critical > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578200#comment-15578200 ] Ashish Shrowty commented on SPARK-17709: Oh .. sorry .. I misread. Will try with 2.0.1 later > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty >Priority: Critical > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576763#comment-15576763 ] Xiao Li commented on SPARK-17709: - That is what I said above. The deduplication is not triggered. It looks weird to me. Please try the 2.0.1. We fixed a lot of bugs in 2.0.1 Thanks! > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty >Priority: Critical > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576143#comment-15576143 ] Ashish Shrowty commented on SPARK-17709: There is a slight difference .. in my case its companyid#121 in both relations whereas in yours its different. Perhaps that is causing the resolution error? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty >Priority: Critical > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15576141#comment-15576141 ] Ashish Shrowty commented on SPARK-17709: There is a slight difference, in my case the IDs generated are the same for e.g. companyid#121 in both aggregates, whereas in your plan the ids are difference companyid#5 and companyid#46. This is probably causing the resolution error? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty >Priority: Critical > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575999#comment-15575999 ] Xiao Li commented on SPARK-17709: - Below is the statements I used to recreate the problem {noformat} sql("CREATE TABLE testext2(companyid int, productid int, price int, count int) using parquet") sql("insert into testext2 values (1, 1, 1, 1)") val d1 = spark.sql("select * from testext2") val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price")) val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count")) df1.join(df2, Seq("companyid", "productid")).show {noformat} Can you try it? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575995#comment-15575995 ] Xiao Li commented on SPARK-17709: - Still works well in 2.0.1 > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15575938#comment-15575938 ] Xiao Li commented on SPARK-17709: - I can get an exactly same plan in the master branch, but my job can pass. {noformat} 'Join UsingJoin(Inner,List('companyid, 'productid)) :- Aggregate [companyid#5, productid#6], [companyid#5, productid#6, sum(cast(price#7 as bigint)) AS price#30L] : +- Project [companyid#5, productid#6, price#7, count#8] : +- SubqueryAlias testext2 :+- Relation[companyid#5,productid#6,price#7,count#8] parquet +- Aggregate [companyid#46, productid#47], [companyid#46, productid#47, sum(cast(count#49 as bigint)) AS count#41L] +- Project [companyid#46, productid#47, price#48, count#49] +- SubqueryAlias testext2 +- Relation[companyid#46,productid#47,price#48,count#49] parquet {noformat} The only difference is yours does not trigger deduplication of expression ids. Let me try it in the 2.0.1 branch. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15573512#comment-15573512 ] Ashish Shrowty commented on SPARK-17709: [~smilegator] I compiled with the added debug information and here is the output - {code:java} scala> val d1 = spark.sql("select * from testext2") d1: org.apache.spark.sql.DataFrame = [productid: int, price: float ... 2 more fields] scala> val df1 = d1.groupBy("companyid","productid").agg(sum("price").as("price")) df1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 more field] scala> val df2 = d1.groupBy("companyid","productid").agg(sum("count").as("count")) df2: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 1 more field] scala> df1.join(df2, Seq("companyid", "productid")).show org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] can not be resolved given input columns: [companyid, productid, price, count] ;; 'Join UsingJoin(Inner,List('companyid, 'productid)) :- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, sum(cast(price#123 as double)) AS price#166] : +- Project [productid#122, price#123, count#124, companyid#121] : +- SubqueryAlias testext2 :+- Relation[productid#122,price#123,count#124,companyid#121] parquet +- Aggregate [companyid#121, productid#122], [companyid#121, productid#122, sum(cast(count#124 as bigint)) AS count#177L] +- Project [productid#122, price#123, count#124, companyid#121] +- SubqueryAlias testext2 +- Relation[productid#122,price#123,count#124,companyid#121] parquet at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589) at org.apache.spark.sql.Dataset.join(Dataset.scala:641) at org.apache.spark.sql.Dataset.join(Dataset.scala:614) ... 48 elided {code} > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566160#comment-15566160 ] Ashish Shrowty commented on SPARK-17709: Cool.. thanks. Will do this in next day or two. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566126#comment-15566126 ] Xiao Li commented on SPARK-17709: - Below is the link: http://spark.apache.org/docs/latest/building-spark.html > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566124#comment-15566124 ] Xiao Li commented on SPARK-17709: - Below is the link: http://spark.apache.org/docs/latest/building-spark.html > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15566114#comment-15566114 ] Ashish Shrowty commented on SPARK-17709: I assume I would need to modify the Spark code and build Spark libraries locally? Haven't done that before, but willing to try. Is there some docs/links you can point me to that show the best way to go about this? Thanks > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562826#comment-15562826 ] Xiao Li commented on SPARK-17709: - It is pretty hard to reproduce it in our environment. Are you able to make some source changes? https://github.com/apache/spark/pull/15316 Then, it might help us identify the root cause. Thanks! > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15562368#comment-15562368 ] Ashish Shrowty commented on SPARK-17709: [~dkbiswal], [~smilegator] - Hi guys .. any thoughts? I exchanged notes with AWS guys and they were able to replicate the same issue and believe that it might be a Spark issue. Thanks, Ashish > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543921#comment-15543921 ] Ashish Shrowty commented on SPARK-17709: [~dkbiswal] Sorry Dilip .. I keep making typos .. the join was on companyid and product id - scala> df1.join(df2, Seq("companyid","productid")) org.apache.spark.sql.AnalysisException: using columns ['companyid,'productid] can not be resolved given input columns: [companyid, productid, avgprice, avgitemcount] ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589) at org.apache.spark.sql.Dataset.join(Dataset.scala:641) at org.apache.spark.sql.Dataset.join(Dataset.scala:614) ... 48 elided Attached is explain outputs for df1 and df2 - scala> df1.explain == Physical Plan == *HashAggregate(keys=[companyid#53, productid#54], functions=[avg(price#56)]) +- Exchange hashpartitioning(companyid#53, productid#54, 200) +- *HashAggregate(keys=[companyid#53, productid#54], functions=[partial_avg(price#56)]) +- *Sample 0.0, 0.5, false, 2419324063718201506 +- *Project [companyid#53, productid#54, price#56] +- *BatchedScan parquet referencedata.testproduct[productid#54,price#56,companyid#53] Format: ParquetFormat, InputPaths: s3://com.birdzi.datalake.test/testtable/companyid=100, s3://com.birdzi.datalake.test/testtable/co..., PushedFilters: [], ReadSchema: struct scala> df2.explain == Physical Plan == *HashAggregate(keys=[companyid#53, productid#54], functions=[avg(cast(itemcount#57 as bigint))]) +- Exchange hashpartitioning(companyid#53, productid#54, 200) +- *HashAggregate(keys=[companyid#53, productid#54], functions=[partial_avg(cast(itemcount#57 as bigint))]) +- *Sample 0.0, 0.5, false, -7492644014085475670 +- *Project [companyid#53, productid#54, itemcount#57] +- *BatchedScan parquet referencedata.testproduct[productid#54,itemcount#57,companyid#53] Format: ParquetFormat, InputPaths: s3://com.birdzi.datalake.test/testtable/companyid=100, s3://com.birdzi.datalake.test/testtable/co..., PushedFilters: [], ReadSchema: struct > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543141#comment-15543141 ] Dilip Biswal commented on SPARK-17709: -- @ashrowty Hi Ashish, in your example, the column loyalitycardnumber is not in the outputset and that is why we see the exception. I tried using productid instead and got the correct result. {code} scala> df1.join(df2, Seq("companyid","loyaltycardnumber")); org.apache.spark.sql.AnalysisException: using columns ['companyid,'loyaltycardnumber] can not be resolved given input columns: [productid, companyid, avgprice, avgitemcount, companyid, productid] ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:132) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:57) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:61) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2651) at org.apache.spark.sql.Dataset.join(Dataset.scala:679) at org.apache.spark.sql.Dataset.join(Dataset.scala:652) ... 48 elided scala> df1.join(df2, Seq("companyid","productid")); res1: org.apache.spark.sql.DataFrame = [companyid: int, productid: int ... 2 more fields] scala> df1.join(df2, Seq("companyid","productid")).show +-+-+++ |companyid|productid|avgprice|avgitemcount| +-+-+++ | 101|3|13.0|12.0| | 100|1|10.0|10.0| +-+-+++ {code} > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15543046#comment-15543046 ] Dilip Biswal commented on SPARK-17709: -- Hi Ashish, Thanks a lot.. will try and get back. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15538927#comment-15538927 ] Ashish Shrowty commented on SPARK-17709: [~dkbiswal] I just went through manual steps of creating the table in Hive (using EMR 5.0.0), inserting data into it, and then querying using spark .. and got the exception .. steps I followed - Step 1 - hive> create external table referencedata.testproduct ( hive> create external table referencedata.testproduct ( > productid int, > name string, > price double, > itemcount int > ) PARTITIONED BY (companyid int) > STORED AS PARQUET > LOCATION 's3://com.birdzi.datalake.test/testtable' > ; Step 2 - Insert data - set hive.exec.dynamic.partition.mode=nonstrict insert into referencedata.testproduct partition(companyid) values(1,"p1",10.0,10,100); insert into referencedata.testproduct partition(companyid) values(2,"p1",12.0,12,100); insert into referencedata.testproduct partition(companyid) values(3,"p3",13.0,12,101); Step 3 - query using spark-shell - val d1 = spark.sql("select * from referencedata.testproduct") val df1 = d1.sample(false,0.5).select("companyid","productid","price").groupBy("companyid","productid").agg(avg("price").as("avgprice")) val df2 = d1.sample(false,0.5).select("companyid","productid","itemcount").groupBy("companyid","productid").agg(avg("itemcount").as("avgitemcount")) df1.join(df2, Seq("companyid","loyaltycardnumber")) .. throws exception - org.apache.spark.sql.AnalysisException: using columns ['companyid,'loyaltycardnumber] can not be resolved given input columns: [companyid, productid, price, avgitemcount] ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589) at org.apache.spark.sql.Dataset.join(Dataset.scala:641) at org.apache.spark.sql.Dataset.join(Dataset.scala:614) ... 49 elided > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537740#comment-15537740 ] Ashish Shrowty commented on SPARK-17709: Join keys are both companyid and loyaltycardnumber. Wonder why you are not seeing it. I tried it on a few other tables I have and its the same behavior. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537336#comment-15537336 ] Dilip Biswal commented on SPARK-17709: -- [~ashrowty] Hmmn.. and your join keys are companyid or loyalitycardnumber or both ? If so, i have the exact same scenario but not seeing the error you are seeing. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537264#comment-15537264 ] Ashish Shrowty commented on SPARK-17709: [~dkbiswal] Attached are the explain() outputs - df1.explain == Physical Plan == *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], functions=[avg(cast(itemcount#3372 as bigint))]) +- Exchange hashpartitioning(companyid#3364, loyaltycardnumber#3370, 200) +- *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], functions=[partial_avg(cast(itemcount#3372 as bigint))]) +- *Project [loyaltycardnumber#3370, itemcount#3372, companyid#3364] +- *BatchedScan parquet facts.storetransaction[loyaltycardnumber#3370,itemcount#3372,year#3362,month#3363,companyid#3364] Format: ParquetFormat, InputPaths: s3://com.birdzi.datalake.test/basedatasets/facts/storetransaction/2016-09-15-2012/year=2002/month..., PushedFilters: [], ReadSchema: struct df2.explain == Physical Plan == *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], functions=[avg(totalprice#3373)]) +- Exchange hashpartitioning(companyid#3364, loyaltycardnumber#3370, 200) +- *HashAggregate(keys=[companyid#3364, loyaltycardnumber#3370], functions=[partial_avg(totalprice#3373)]) +- *Project [loyaltycardnumber#3370, totalprice#3373, companyid#3364] +- *BatchedScan parquet facts.storetransaction[loyaltycardnumber#3370,totalprice#3373,year#3362,month#3363,companyid#3364] Format: ParquetFormat, InputPaths: s3://com.birdzi.datalake.test/basedatasets/facts/storetransaction/2016-09-15-2012/year=2002/month..., PushedFilters: [], ReadSchema: struct > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15537205#comment-15537205 ] Dilip Biswal commented on SPARK-17709: -- @ashrowty Hi Ashish, is it possible for you to post explain output for both the legs of the join. So if we are joining two dataframes df1 and df2 , can we get the output of df1.explain(true) df2.explain(true) >From the error, it seems like key1 and key2 are not present in one leg of join >output attribute set. So if i were to change your test program to the following : val df1 = d1.groupBy("key1", "key2") .agg(avg("totalprice").as("avgtotalprice")) df1.explain(true) val df2 = d1.agg(avg("itemcount").as("avgqty")) df2.explain(true) df1.join(df2, Seq("key1", "key2")) I am able to see the same error you are seeing. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536694#comment-15536694 ] Ashish Shrowty commented on SPARK-17709: Sorry ... its not really col1, its another column .. edited it to col8 > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536464#comment-15536464 ] Dilip Biswal commented on SPARK-17709: -- [~ashrowty] Ashish, you have the same column name as regular and partitioning columns ? I thought hive didn't allow it ? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535957#comment-15535957 ] Ashish Shrowty commented on SPARK-17709: Sure .. the data is brought over into the EMR (5.0.0) HDFS cluster via sqoop. Once there, I issue the following commands in Hive (2.1.0) to store it in S3 - CREATE EXTERNAL TABLE ( col1 bigint, col2 int, col3 string, ) PARTITIONED BY (col1 int) STORED AS PARQUET LOCATION 's3_table_dir' INSERT into SELECT col1,col2, FROM > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534910#comment-15534910 ] Xiao Li commented on SPARK-17709: - [~ashrowty]Can you share the exact way how you load the external table? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534518#comment-15534518 ] Ashish Shrowty commented on SPARK-17709: Dilip, I tried your code and it works on my end too. It's only when I try load an external table stored as parquet (in my case its stored in S3). Attaching stack trace if that helps (this time I tried on on a different table and hence the difference in column names) - org.apache.spark.sql.AnalysisException: using columns ['productid] can not be resolved given input columns: [productid, name1, name2] ; at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:126) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:67) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:58) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2589) at org.apache.spark.sql.Dataset.join(Dataset.scala:641) at org.apache.spark.sql.Dataset.join(Dataset.scala:614) > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534434#comment-15534434 ] Dilip Biswal commented on SPARK-17709: -- [~smilegator] Sure. > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534429#comment-15534429 ] Xiao Li commented on SPARK-17709: - Can you try it in the latest 2.0? > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534417#comment-15534417 ] Dilip Biswal commented on SPARK-17709: -- [~smilegator] Hi Sean, I tried it on my master branch and don't see the exception. {code} test("join issue") { withTable("tbl") { sql("CREATE TABLE tbl(key1 int, key2 int, totalprice int, itemcount int)") sql("insert into tbl values (1, 1, 1, 1)") val d1 = sql("select * from tbl") val df1 = d1.groupBy("key1","key2") .agg(avg("totalprice").as("avgtotalprice")) val df2 = d1.groupBy("key1","key2") .agg(avg("itemcount").as("avgqty")) df1.join(df2, Seq("key1","key2")).show() } } Output +++-+--+ |key1|key2|avgtotalprice|avgqty| +++-+--+ | 1| 1| 1.0| 1.0| +++-+--+ {code} > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17709) spark 2.0 join - column resolution error
[ https://issues.apache.org/jira/browse/SPARK-17709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15534338#comment-15534338 ] Xiao Li commented on SPARK-17709: - Let me try to reproduce it. Thanks! > spark 2.0 join - column resolution error > > > Key: SPARK-17709 > URL: https://issues.apache.org/jira/browse/SPARK-17709 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Ashish Shrowty > Labels: easyfix > > If I try to inner-join two dataframes which originated from the same initial > dataframe that was loaded using spark.sql() call, it results in an error - > // reading from Hive .. the data is stored in Parquet format in Amazon S3 > val d1 = spark.sql("select * from ") > val df1 = d1.groupBy("key1","key2") > .agg(avg("totalprice").as("avgtotalprice")) > val df2 = d1.groupBy("key1","key2") > .agg(avg("itemcount").as("avgqty")) > df1.join(df2, Seq("key1","key2")) gives error - > org.apache.spark.sql.AnalysisException: using columns ['key1,'key2] can > not be resolved given input columns: [key1, key2, avgtotalprice, avgqty]; > If the same Dataframe is initialized via spark.read.parquet(), the above code > works. This same code above worked with Spark 1.6.2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org