[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304454#comment-17304454
 ] 

Dongjoon Hyun commented on SPARK-34785:
---

FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 

[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304453#comment-17304453
 ] 

Dongjoon Hyun commented on SPARK-34785:
---

SPARK-34212 is complete by itself because it's designed to fix the correctness 
issue. You will not get incorrect values.
For this specific `Schema evolution` requirement in the PR description, I don't 
have a bandwidth, [~jobitmathew].

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch

[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread jobit mathew (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303904#comment-17303904
 ] 

jobit mathew commented on SPARK-34785:
--

[~dongjoon] can you check it once.

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:312)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283)
> at 
> org.apa