[jira] [Created] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue
fengtlyer created SPARK-27826: - Summary: saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue Key: SPARK-27826 URL: https://issues.apache.org/jira/browse/SPARK-27826 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.2.0 Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2 CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1 Reporter: fengtlyer Hi Spark Dev Team, We tested a few times and found this bug can reappearance in multi Spark version We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1 Both of them have this bug: 1. If one table created by Impala or Hive in the HUE, then in Spark code, "write.format("parquet").mode("append").saveAsTable()" will case the format issue (see the below error log) 2. Hive/Impala in the HUE created table, then "write.format("parquet").mode("overwrite").saveAsTable()", this code still does not work. 2.1 Hive/Impala in the HUE created table, and "write.format("parquet").mode("overwrite").saveAsTable()", then "write.format("parquet").mode("append").saveAsTable()" can work. 3. Hive/Impala in the HUE created table, then "insertInto()" still will work. 3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert some new record, then try to use "write.format("parquet").mode("append").saveAsTable()", it will get the same format error log 4. Created parquet table and insert some data by Hive shell, then "write.format("parquet").mode("append").saveAsTable()" can insert data, but spark only shows data which insert by spark, and Hive only show data which hive insert. === Error Log === spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table") org.apache.spark.sql.AnalysisException: The format of the existing table default.parquet_test_table is `HiveFileFormat`. It doesn't match the specified format `ParquetFileFormat`.; at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) at org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) at org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78) at org.apache.spark.sql.execution.QueryExecution.completeString(QueryExecution.scala:220) at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:203) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609) at org.apache.spark.sql.D
[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue
[ https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848811#comment-16848811 ] fengtlyer commented on SPARK-27826: --- Hi Hyukjin, Our team think this is a compatibility issue. We are fully understand, if we use format("hive") this line of code would work, However, all of our actions should work with "parquet" format. Why it is not parquet format when we use impala created parquet table? If we use impala SQL query "stored as parquet" in the Hue, then we checked the HDFS, the files end with ".parq", but we can't use "write.format("parquet").mode("append").saveAsTable()" to append this table. We think there should be some compatibility issues. > saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" > format issue > > > Key: SPARK-27826 > URL: https://issues.apache.org/jira/browse/SPARK-27826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 > Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2 > CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1 >Reporter: fengtlyer >Priority: Minor > > Hi Spark Dev Team, > We tested a few times and found this bug can reappearance in multi Spark > version > We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark > version 2.4.0-cdh6.1.1 > Both of them have this bug: > 1. If one table created by Impala or Hive in the HUE, then in Spark code, > "write.format("parquet").mode("append").saveAsTable()" will case the format > issue (see the below error log) > 2. Hive/Impala in the HUE created table, then > "write.format("parquet").mode("overwrite").saveAsTable()", this code still > does not work. > 2.1 Hive/Impala in the HUE created table, and > "write.format("parquet").mode("overwrite").saveAsTable()", then > "write.format("parquet").mode("append").saveAsTable()" can work. > 3. Hive/Impala in the HUE created table, then "insertInto()" still will work. > 3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert > some new record, then try to use > "write.format("parquet").mode("append").saveAsTable()", it will get the same > format error log > 4. Created parquet table and insert some data by Hive shell, then > "write.format("parquet").mode("append").saveAsTable()" can insert data, but > spark only shows data which insert by spark, and Hive only show data which > hive insert. > === > Error Log > === > {code} > spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table") > {code} > {code} > org.apache.spark.sql.AnalysisException: The format of the existing table > default.parquet_test_table is `HiveFileFormat`. It doesn't match the > specified format `ParquetFileFormat`.; > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.e
[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue
[ https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493316#comment-17493316 ] fengtlyer commented on SPARK-27826: --- Hi Bandhu, sorry for late reply. At the end, we decide re-create all the table by one way (spark) and we didn't face this issue any more > saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" > format issue > > > Key: SPARK-27826 > URL: https://issues.apache.org/jira/browse/SPARK-27826 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0, 2.4.0 > Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2 > CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1 >Reporter: fengtlyer >Priority: Minor > > Hi Spark Dev Team, > We tested a few times and found this bug can reappearance in multi Spark > version > We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark > version 2.4.0-cdh6.1.1 > Both of them have this bug: > 1. If one table created by Impala or Hive in the HUE, then in Spark code, > "write.format("parquet").mode("append").saveAsTable()" will case the format > issue (see the below error log) > 2. Hive/Impala in the HUE created table, then > "write.format("parquet").mode("overwrite").saveAsTable()", this code still > does not work. > 2.1 Hive/Impala in the HUE created table, and > "write.format("parquet").mode("overwrite").saveAsTable()", then > "write.format("parquet").mode("append").saveAsTable()" can work. > 3. Hive/Impala in the HUE created table, then "insertInto()" still will work. > 3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert > some new record, then try to use > "write.format("parquet").mode("append").saveAsTable()", it will get the same > format error log > 4. Created parquet table and insert some data by Hive shell, then > "write.format("parquet").mode("append").saveAsTable()" can insert data, but > spark only shows data which insert by spark, and Hive only show data which > hive insert. > === > Error Log > === > {code} > spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table") > {code} > {code} > org.apache.spark.sql.AnalysisException: The format of the existing table > default.parquet_test_table is `HiveFileFormat`. It doesn't match the > specified format `ParquetFileFormat`.; > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266) > at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75) > at > org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74) > at scala.collection.immutable.List.foreach(List.scala:381) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74) > at > org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69) > at > org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67) > at > org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50) > at > org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73) > at > org.apache.spark.sql.execution.QueryEx