[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304454#comment-17304454 ] Dongjoon Hyun commented on SPARK-34785: --- FYI, each data sources have different schema evolution capabilities. And, Parquet is not the best built-in data source in terms of it. We are tracking it with the following test suite. - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) > at
[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304453#comment-17304453 ] Dongjoon Hyun commented on SPARK-34785: --- SPARK-34212 is complete by itself because it's designed to fix the correctness issue. You will not get incorrect values. For this specific `Schema evolution` requirement in the PR description, I don't have a bandwidth, [~jobitmathew]. > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch
[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.
[ https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303904#comment-17303904 ] jobit mathew commented on SPARK-34785: -- [~dongjoon] can you check it once. > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. Error Parquet column cannot be converted in file. > -- > > Key: SPARK-34785 > URL: https://issues.apache.org/jira/browse/SPARK-34785 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: jobit mathew >Priority: Major > > SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true > which is default value. > IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it > will reduce the performance. > In Hive, > {code:java} > create table test_decimal(amt decimal(18,2)) stored as parquet; > insert into test_decimal select 100; > alter table test_decimal change amt amt decimal(19,3); > {code} > In Spark, > {code:java} > select * from test_decimal; > {code} > {code:java} > ++ > |amt | > ++ > | 100.000 |{code} > but if spark.sql.parquet.enableVectorizedReader=true below error > {code:java} > : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal; > going to print operations logs > printed operations logs > going to print operations logs > printed operations logs > Getting log thread is interrupted, since query is done! > Error: org.apache.hive.service.cli.HiveSQLException: Error running query: > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4) (vm2 executor 2): > org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot > be converted in file > hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], > Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > at org.apache.spark.scheduler.Task.run(Task.scala:131) > at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) > at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: > org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:312) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:283) > at > org.apa