subject:"\[jira\] \[Assigned\] \(SPARK\-42715\) NegativeArraySizeException by too many datas read from ORC file"

[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file

2023-03-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42715:


Assignee: Apache Spark

> NegativeArraySizeException by too many datas read from ORC file
> ---
>
> Key: SPARK-42715
> URL: https://issues.apache.org/jira/browse/SPARK-42715
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: XiaoLong Wu
>Assignee: Apache Spark
>Priority: Minor
>
> If need more friendly exception msg about how to avoid this exception? Like 
> when we catch this expetion, told user can reduce the value about 
> spark.sql.orc.columnarReaderBatchSize;
> In the current version, for batch reading of orc files, we use the function 
> OrcColumnarBatchReader.nextBatch() to do this and depends on 
> [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in 
> ORC relevant code is as follows:
> {code:java}
> private static byte[] commonReadByteArrays(InStream stream, IntegerReader 
> lengths,
> LongColumnVector scratchlcv,
> BytesColumnVector result, final int batchSize) throws IOException {
>   // Read lengths
>   scratchlcv.isRepeating = result.isRepeating;
>   scratchlcv.noNulls = result.noNulls;
>   scratchlcv.isNull = result.isNull;  // Notice we are replacing the isNull 
> vector here...
>   lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize);
>   int totalLength = 0;
>   if (!scratchlcv.isRepeating) {
> for (int i = 0; i < batchSize; i++) {
>   if (!scratchlcv.isNull[i]) {
> totalLength += (int) scratchlcv.vector[i];
>   }
> }
>   } else {
> if (!scratchlcv.isNull[0]) {
>   totalLength = (int) (batchSize * scratchlcv.vector[0]);
> }
>   }
>   // Read all the strings for this batch
>   byte[] allBytes = new byte[totalLength];
>   int offset = 0;
>   int len = totalLength;
>   while (len > 0) {
> int bytesRead = stream.read(allBytes, offset, len);
> if (bytesRead < 0) {
>   throw new EOFException("Can't finish byte read from " + stream);
> }
> len -= bytesRead;
> offset += bytesRead;
>   }
>   return allBytes;
> } {code}
>  As shown above, totalLength as a Long type param is used to mark the data 
> size. If the data size too big to over max_int, converting to int will lead 
> to value overflow and throws the following exception:
> {code:java}
> Caused by: java.lang.NegativeArraySizeException
>     at 
> org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998)
>     at 
> org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021)
>     at 
> org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119)
>     at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
>     at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>     at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>     at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>     at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>     at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
>     at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>     at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
>     ... 20 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file

2023-03-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42715:


Assignee: (was: Apache Spark)

> NegativeArraySizeException by too many datas read from ORC file
> ---
>
> Key: SPARK-42715
> URL: https://issues.apache.org/jira/browse/SPARK-42715
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: XiaoLong Wu
>Priority: Minor
>
> If need more friendly exception msg about how to avoid this exception? Like 
> when we catch this expetion, told user can reduce the value about 
> spark.sql.orc.columnarReaderBatchSize;
> In the current version, for batch reading of orc files, we use the function 
> OrcColumnarBatchReader.nextBatch() to do this and depends on 
> [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in 
> ORC relevant code is as follows:
> {code:java}
> private static byte[] commonReadByteArrays(InStream stream, IntegerReader 
> lengths,
> LongColumnVector scratchlcv,
> BytesColumnVector result, final int batchSize) throws IOException {
>   // Read lengths
>   scratchlcv.isRepeating = result.isRepeating;
>   scratchlcv.noNulls = result.noNulls;
>   scratchlcv.isNull = result.isNull;  // Notice we are replacing the isNull 
> vector here...
>   lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize);
>   int totalLength = 0;
>   if (!scratchlcv.isRepeating) {
> for (int i = 0; i < batchSize; i++) {
>   if (!scratchlcv.isNull[i]) {
> totalLength += (int) scratchlcv.vector[i];
>   }
> }
>   } else {
> if (!scratchlcv.isNull[0]) {
>   totalLength = (int) (batchSize * scratchlcv.vector[0]);
> }
>   }
>   // Read all the strings for this batch
>   byte[] allBytes = new byte[totalLength];
>   int offset = 0;
>   int len = totalLength;
>   while (len > 0) {
> int bytesRead = stream.read(allBytes, offset, len);
> if (bytesRead < 0) {
>   throw new EOFException("Can't finish byte read from " + stream);
> }
> len -= bytesRead;
> offset += bytesRead;
>   }
>   return allBytes;
> } {code}
>  As shown above, totalLength as a Long type param is used to mark the data 
> size. If the data size too big to over max_int, converting to int will lead 
> to value overflow and throws the following exception:
> {code:java}
> Caused by: java.lang.NegativeArraySizeException
>     at 
> org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998)
>     at 
> org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021)
>     at 
> org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119)
>     at 
> org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
>     at 
> org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
>     at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
>     at 
> org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
>     at 
> org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
>     at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
>     at 
> org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
>     at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
>     at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
>     ... 20 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file

[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file

2 matches

Site Navigation

Mail list logo

Footer information