[jira] [Updated] (SPARK-40253) Data read exception in orc format

yihangqiao (Jira) Sun, 28 Aug 2022 23:43:05 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-40253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


yihangqiao updated SPARK-40253:
-------------------------------
    Description: 
Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
offset: 0 limit: 0

When running batches using spark-sql and using the create table xxx as select 
syntax, the select query part uses a static value as the default value (0.00 as 
column_name) and does not specify the data type of the default value. In this 
usage scenario, because the data type is not explicitly specified, the metadata 
information of the field in the written ORC file is missing (the writing is 
successful), but when reading, as long as the query column contains this field, 
it will not be able to Parsing the ORC file, the following error occurs：

 
{code:java}
create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
java.io.IOException: Error reading file: 
viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-00000-e7df51a1-98b9-4472-9899-3c132b97885b-c000
       at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)      
 at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
       at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
       at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
       at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
 Source)       at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)       at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
       at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
       at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)  
     at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)  
     at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
       at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
       at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)       
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
       at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)     
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)      
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
      at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
      at java.lang.Thread.run(Thread.java:748)Caused by: java.io.EOFException: 
Read past end of RLE integer from compressed stream Stream for column 1 kind 
SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0       at 
org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
       at 
org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
       at 
org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
       at 
org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)
       at 
org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1279)
       at 
org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2012)
       at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1284)      
 ... 25 more
 {code}
 

 

  was:
{code:java}
//代码占位符
{code}
When running batches using spark-sql and using the create table xxx as select 
syntax, the select query part uses a static value as the default value (0.00 as 
column_name) and does not specify the data type of the default value. In this 
usage scenario, because the data type is not explicitly specified, the metadata 
information of the field in the written ORC file is missing (the writing is 
successful), but when reading, as long as the query column contains this field, 
it will not be able to Parsing the ORC file, the following error occurs：


Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
offset: 0 limit: 0


>  Data read exception in orc format
> ----------------------------------
>
>                 Key: SPARK-40253
>                 URL: https://issues.apache.org/jira/browse/SPARK-40253
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.3
>         Environment: os centos7
> spark 2.4.3
> hive 1.2.1
> hadoop 2.7.2
>            Reporter: yihangqiao
>            Priority: Major
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Caused by: java.io.EOFException: Read past end of RLE integer from compressed 
> stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 
> offset: 0 limit: 0
> When running batches using spark-sql and using the create table xxx as select 
> syntax, the select query part uses a static value as the default value (0.00 
> as column_name) and does not specify the data type of the default value. In 
> this usage scenario, because the data type is not explicitly specified, the 
> metadata information of the field in the written ORC file is missing (the 
> writing is successful), but when reading, as long as the query column 
> contains this field, it will not be able to Parsing the ORC file, the 
> following error occurs：
>  
> {code:java}
> create table testgg as select 0.00 as gg;select * from testgg;Caused by: 
> java.io.IOException: Error reading file: 
> viewfs://bdphdp10/user/hive/warehouse/hadoop/testgg/part-00000-e7df51a1-98b9-4472-9899-3c132b97885b-c000
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1291)    
>    at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:227)
>        at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:109)
>        at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:181)
>        at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
>        at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown
>  Source)       at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)       at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>        at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
>        at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
>        at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)      
>  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)       at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:288)       at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)       at 
> org.apache.spark.scheduler.Task.run(Task.scala:121)       at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)   
>     at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)  
>      at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>        at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>        at java.lang.Thread.run(Thread.java:748)Caused by: 
> java.io.EOFException: Read past end of RLE integer from compressed stream 
> Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 
> limit: 0       at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
>        at 
> org.apache.orc.impl.RunLengthIntegerReaderV2.nextVector(RunLengthIntegerReaderV2.java:398)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1205)
>        at 
> org.apache.orc.impl.TreeReaderFactory$DecimalTreeReader.nextVector(TreeReaderFactory.java:1279)
>        at 
> org.apache.orc.impl.TreeReaderFactory$StructTreeReader.nextBatch(TreeReaderFactory.java:2012)
>        at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1284)    
>    ... 25 more
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40253) Data read exception in orc format

Reply via email to