[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
[ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42715: Assignee: Apache Spark > NegativeArraySizeException by too many datas read from ORC file > --- > > Key: SPARK-42715 > URL: https://issues.apache.org/jira/browse/SPARK-42715 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: XiaoLong Wu >Assignee: Apache Spark >Priority: Minor > > If need more friendly exception msg about how to avoid this exception? Like > when we catch this expetion, told user can reduce the value about > spark.sql.orc.columnarReaderBatchSize; > In the current version, for batch reading of orc files, we use the function > OrcColumnarBatchReader.nextBatch() to do this and depends on > [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in > ORC relevant code is as follows: > {code:java} > private static byte[] commonReadByteArrays(InStream stream, IntegerReader > lengths, > LongColumnVector scratchlcv, > BytesColumnVector result, final int batchSize) throws IOException { > // Read lengths > scratchlcv.isRepeating = result.isRepeating; > scratchlcv.noNulls = result.noNulls; > scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull > vector here... > lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); > int totalLength = 0; > if (!scratchlcv.isRepeating) { > for (int i = 0; i < batchSize; i++) { > if (!scratchlcv.isNull[i]) { > totalLength += (int) scratchlcv.vector[i]; > } > } > } else { > if (!scratchlcv.isNull[0]) { > totalLength = (int) (batchSize * scratchlcv.vector[0]); > } > } > // Read all the strings for this batch > byte[] allBytes = new byte[totalLength]; > int offset = 0; > int len = totalLength; > while (len > 0) { > int bytesRead = stream.read(allBytes, offset, len); > if (bytesRead < 0) { > throw new EOFException("Can't finish byte read from " + stream); > } > len -= bytesRead; > offset += bytesRead; > } > return allBytes; > } {code} > As shown above, totalLength as a Long type param is used to mark the data > size. If the data size too big to over max_int, converting to int will lead > to value overflow and throws the following exception: > {code:java} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) > at > org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) > ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42715) NegativeArraySizeException by too many datas read from ORC file
[ https://issues.apache.org/jira/browse/SPARK-42715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42715: Assignee: (was: Apache Spark) > NegativeArraySizeException by too many datas read from ORC file > --- > > Key: SPARK-42715 > URL: https://issues.apache.org/jira/browse/SPARK-42715 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: XiaoLong Wu >Priority: Minor > > If need more friendly exception msg about how to avoid this exception? Like > when we catch this expetion, told user can reduce the value about > spark.sql.orc.columnarReaderBatchSize; > In the current version, for batch reading of orc files, we use the function > OrcColumnarBatchReader.nextBatch() to do this and depends on > [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in > ORC relevant code is as follows: > {code:java} > private static byte[] commonReadByteArrays(InStream stream, IntegerReader > lengths, > LongColumnVector scratchlcv, > BytesColumnVector result, final int batchSize) throws IOException { > // Read lengths > scratchlcv.isRepeating = result.isRepeating; > scratchlcv.noNulls = result.noNulls; > scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull > vector here... > lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); > int totalLength = 0; > if (!scratchlcv.isRepeating) { > for (int i = 0; i < batchSize; i++) { > if (!scratchlcv.isNull[i]) { > totalLength += (int) scratchlcv.vector[i]; > } > } > } else { > if (!scratchlcv.isNull[0]) { > totalLength = (int) (batchSize * scratchlcv.vector[0]); > } > } > // Read all the strings for this batch > byte[] allBytes = new byte[totalLength]; > int offset = 0; > int len = totalLength; > while (len > 0) { > int bytesRead = stream.read(allBytes, offset, len); > if (bytesRead < 0) { > throw new EOFException("Can't finish byte read from " + stream); > } > len -= bytesRead; > offset += bytesRead; > } > return allBytes; > } {code} > As shown above, totalLength as a Long type param is used to mark the data > size. If the data size too big to over max_int, converting to int will lead > to value overflow and throws the following exception: > {code:java} > Caused by: java.lang.NegativeArraySizeException > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) > at > org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) > at > org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) > at > org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) > at > org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) > at > org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) > at > org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) > at > org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) > ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org