[ https://issues.apache.org/jira/browse/SPARK-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14681246#comment-14681246 ]
Herman van Hovell commented on SPARK-9813: ------------------------------------------ So I am not to sure if we want to maintain that level of Hive compatibility. It seems a bit too strict. Any kind of union should be fine as long as the data types match (IMHO). Is there a realistic use case for this? > Incorrect UNION ALL behavior > ---------------------------- > > Key: SPARK-9813 > URL: https://issues.apache.org/jira/browse/SPARK-9813 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL > Affects Versions: 1.4.1 > Environment: Ubuntu on AWS > Reporter: Simeon Simeonov > Labels: sql, union > > According to the [Hive Language > Manual|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Union] > for UNION ALL: > {quote} > The number and names of columns returned by each select_statement have to be > the same. Otherwise, a schema error is thrown. > {quote} > Spark SQL silently swallows an error when the tables being joined with UNION > ALL have the same number of columns but different names. > Reproducible example: > {code} > // This test is meant to run in spark-shell > import java.io.File > import java.io.PrintWriter > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.SaveMode > val ctx = sqlContext.asInstanceOf[HiveContext] > import ctx.implicits._ > def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines" > def tempTable(name: String, json: String) = { > val path = dataPath(name) > new PrintWriter(path) { write(json); close } > ctx.read.json("file://" + path).registerTempTable(name) > } > // Note category vs. cat names of first column > tempTable("test_one", """{"category" : "A", "num" : 5}""") > tempTable("test_another", """{"cat" : "A", "num" : 5}""") > // +--------+---+ > // |category|num| > // +--------+---+ > // | A| 5| > // | A| 5| > // +--------+---+ > // > // Instead, an error should have been generated due to incompatible schema > ctx.sql("select * from test_one union all select * from test_another").show > // Cleanup > new File(dataPath("test_one")).delete() > new File(dataPath("test_another")).delete() > {code} > When the number of columns is different, Spark can even mix in datatypes. > Reproducible example (requires a new spark-shell session): > {code} > // This test is meant to run in spark-shell > import java.io.File > import java.io.PrintWriter > import org.apache.spark.sql.hive.HiveContext > import org.apache.spark.sql.SaveMode > val ctx = sqlContext.asInstanceOf[HiveContext] > import ctx.implicits._ > def dataPath(name:String) = sys.env("HOME") + "/" + name + ".jsonlines" > def tempTable(name: String, json: String) = { > val path = dataPath(name) > new PrintWriter(path) { write(json); close } > ctx.read.json("file://" + path).registerTempTable(name) > } > // Note test_another is missing category column > tempTable("test_one", """{"category" : "A", "num" : 5}""") > tempTable("test_another", """{"num" : 5}""") > // +--------+ > // |category| > // +--------+ > // | A| > // | 5| > // +--------+ > // > // Instead, an error should have been generated due to incompatible schema > ctx.sql("select * from test_one union all select * from test_another").show > // Cleanup > new File(dataPath("test_one")).delete() > new File(dataPath("test_another")).delete() > {code} > At other times, when the schema are complex, Spark SQL produces a misleading > error about an unresolved Union operator: > {code} > scala> ctx.sql("""select * from view_clicks > | union all > | select * from view_clicks_aug > | """) > 15/08/11 02:40:25 INFO ParseDriver: Parsing command: select * from view_clicks > union all > select * from view_clicks_aug > 15/08/11 02:40:25 INFO ParseDriver: Parse Completed > 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default > tbl=view_clicks > 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr > cmd=get_table : db=default tbl=view_clicks > 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default > tbl=view_clicks > 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr > cmd=get_table : db=default tbl=view_clicks > 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default > tbl=view_clicks_aug > 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr > cmd=get_table : db=default tbl=view_clicks_aug > 15/08/11 02:40:25 INFO HiveMetaStore: 0: get_table : db=default > tbl=view_clicks_aug > 15/08/11 02:40:25 INFO audit: ugi=ubuntu ip=unknown-ip-addr > cmd=get_table : db=default tbl=view_clicks_aug > org.apache.spark.sql.AnalysisException: unresolved operator 'Union; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:126) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:98) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$foreachUp$1.apply(TreeNode.scala:97) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:97) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:42) > at > org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:931) > at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:131) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:755){code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org