[jira] [Created] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue

2019-05-23 Thread fengtlyer (JIRA)
fengtlyer created SPARK-27826:
-

 Summary: saveAsTable() function case table have "HiveFileFormat" 
"ParquetFileFormat" format issue
 Key: SPARK-27826
 URL: https://issues.apache.org/jira/browse/SPARK-27826
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.2.0
 Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2

CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1
Reporter: fengtlyer


Hi Spark Dev Team,

We tested a few times and found this bug can reappearance in multi Spark version

We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark 
version 2.4.0-cdh6.1.1

Both of them have this bug:
1. If one table created by Impala or Hive in the HUE, then in Spark code, 
"write.format("parquet").mode("append").saveAsTable()" will case the format 
issue (see the below error log)
2. Hive/Impala in the HUE created table, then 
"write.format("parquet").mode("overwrite").saveAsTable()", this code still does 
not work.
 2.1 Hive/Impala in the HUE created table, and 
"write.format("parquet").mode("overwrite").saveAsTable()", then 
"write.format("parquet").mode("append").saveAsTable()" can work.
3. Hive/Impala in the HUE created table, then "insertInto()" still will work.
 3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert 
some new record, then try to use 
"write.format("parquet").mode("append").saveAsTable()", it will get the same 
format error log
4. Created parquet table and insert some data by Hive shell, then 
"write.format("parquet").mode("append").saveAsTable()" can insert data, but 
spark only shows data which insert by spark, and Hive only show data which hive 
insert.


=== 
Error Log 
===
spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table")

org.apache.spark.sql.AnalysisException: The format of the existing table 
default.parquet_test_table is `HiveFileFormat`. It doesn't match the specified 
format `ParquetFileFormat`.;
at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115)
at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
at 
org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.withCachedData(QueryExecution.scala:72)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at 
org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at 
org.apache.spark.sql.execution.QueryExecution.completeString(QueryExecution.scala:220)
at 
org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:203)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:609)
at org.apache.spark.sql.D

[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue

2019-05-27 Thread fengtlyer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848811#comment-16848811
 ] 

fengtlyer commented on SPARK-27826:
---

Hi Hyukjin,

Our team think this is a compatibility issue.

We are fully understand, if we use format("hive") this line of code would  
work, However, all of our actions should work with "parquet" format.

Why it is not parquet format when we use impala created parquet table?

If we use impala SQL query "stored as parquet" in the Hue, then we checked the 
HDFS, the files end with ".parq", but we can't use 
"write.format("parquet").mode("append").saveAsTable()"  to append this table.

We think there should be some compatibility issues.

> saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" 
> format issue
> 
>
> Key: SPARK-27826
> URL: https://issues.apache.org/jira/browse/SPARK-27826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
> Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2
> CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1
>Reporter: fengtlyer
>Priority: Minor
>
> Hi Spark Dev Team,
> We tested a few times and found this bug can reappearance in multi Spark 
> version
> We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark 
> version 2.4.0-cdh6.1.1
> Both of them have this bug:
> 1. If one table created by Impala or Hive in the HUE, then in Spark code, 
> "write.format("parquet").mode("append").saveAsTable()" will case the format 
> issue (see the below error log)
> 2. Hive/Impala in the HUE created table, then 
> "write.format("parquet").mode("overwrite").saveAsTable()", this code still 
> does not work.
>  2.1 Hive/Impala in the HUE created table, and 
> "write.format("parquet").mode("overwrite").saveAsTable()", then 
> "write.format("parquet").mode("append").saveAsTable()" can work.
> 3. Hive/Impala in the HUE created table, then "insertInto()" still will work.
>  3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert 
> some new record, then try to use 
> "write.format("parquet").mode("append").saveAsTable()", it will get the same 
> format error log
> 4. Created parquet table and insert some data by Hive shell, then 
> "write.format("parquet").mode("append").saveAsTable()" can insert data, but 
> spark only shows data which insert by spark, and Hive only show data which 
> hive insert.
> === 
> Error Log 
> ===
> {code}
> spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: The format of the existing table 
> default.parquet_test_table is `HiveFileFormat`. It doesn't match the 
> specified format `ParquetFileFormat`.;
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
> at 
> org.apache.spark.sql.e

[jira] [Commented] (SPARK-27826) saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" format issue

2022-02-16 Thread fengtlyer (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493316#comment-17493316
 ] 

fengtlyer commented on SPARK-27826:
---

Hi Bandhu, sorry for late reply.

At the end, we decide re-create all the table by one way (spark) and we didn't 
face this issue any more

> saveAsTable() function case table have "HiveFileFormat" "ParquetFileFormat" 
> format issue
> 
>
> Key: SPARK-27826
> URL: https://issues.apache.org/jira/browse/SPARK-27826
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0, 2.4.0
> Environment: CDH 5.13.1 - Spark version 2.2.0.cloudera2
> CDH 6.1.1 - Spark version 2.4.0-cdh6.1.1
>Reporter: fengtlyer
>Priority: Minor
>
> Hi Spark Dev Team,
> We tested a few times and found this bug can reappearance in multi Spark 
> version
> We tested in CDH 5.13.1 - Spark version 2.2.0.cloudera2 and CDH 6.1.1 - Spark 
> version 2.4.0-cdh6.1.1
> Both of them have this bug:
> 1. If one table created by Impala or Hive in the HUE, then in Spark code, 
> "write.format("parquet").mode("append").saveAsTable()" will case the format 
> issue (see the below error log)
> 2. Hive/Impala in the HUE created table, then 
> "write.format("parquet").mode("overwrite").saveAsTable()", this code still 
> does not work.
>  2.1 Hive/Impala in the HUE created table, and 
> "write.format("parquet").mode("overwrite").saveAsTable()", then 
> "write.format("parquet").mode("append").saveAsTable()" can work.
> 3. Hive/Impala in the HUE created table, then "insertInto()" still will work.
>  3.1 Hive/Impala in the HUE created a table, and used "insertInto()" insert 
> some new record, then try to use 
> "write.format("parquet").mode("append").saveAsTable()", it will get the same 
> format error log
> 4. Created parquet table and insert some data by Hive shell, then 
> "write.format("parquet").mode("append").saveAsTable()" can insert data, but 
> spark only shows data which insert by spark, and Hive only show data which 
> hive insert.
> === 
> Error Log 
> ===
> {code}
> spark.read.format("csv").option("sep",",").option("header","true").load("hdfs:///temp1/test_paquettest.csv").write.format("parquet").mode("append").saveAsTable("parquet_test_table")
> {code}
> {code}
> org.apache.spark.sql.AnalysisException: The format of the existing table 
> default.parquet_test_table is `HiveFileFormat`. It doesn't match the 
> specified format `ParquetFileFormat`.;
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation$$anonfun$apply$2.applyOrElse(rules.scala:75)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:75)
> at 
> org.apache.spark.sql.execution.datasources.PreprocessTableCreation.apply(rules.scala:71)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
> at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
> at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
> at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:69)
> at 
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:67)
> at 
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:50)
> at 
> org.apache.spark.sql.execution.QueryExecution.withCachedData$lzycompute(QueryExecution.scala:73)
> at 
> org.apache.spark.sql.execution.QueryEx