XiaoLong Wu created SPARK-42715: ----------------------------------- Summary: NegativeArraySizeException by too many datas read from ORC file Key: SPARK-42715 URL: https://issues.apache.org/jira/browse/SPARK-42715 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: XiaoLong Wu
If need more friendly exception msg about how to avoid this exception? Like when we catch this expetion, told user can reduce the value about spark.sql.orc.columnarReaderBatchSize; In the current version, for batch reading of orc files, we use the function OrcColumnarBatchReader.nextBatch() to do this and depends on [ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in ORC relevant code is as follows: {code:java} private static byte[] commonReadByteArrays(InStream stream, IntegerReader lengths, LongColumnVector scratchlcv, BytesColumnVector result, final int batchSize) throws IOException { // Read lengths scratchlcv.isRepeating = result.isRepeating; scratchlcv.noNulls = result.noNulls; scratchlcv.isNull = result.isNull; // Notice we are replacing the isNull vector here... lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize); int totalLength = 0; if (!scratchlcv.isRepeating) { for (int i = 0; i < batchSize; i++) { if (!scratchlcv.isNull[i]) { totalLength += (int) scratchlcv.vector[i]; } } } else { if (!scratchlcv.isNull[0]) { totalLength = (int) (batchSize * scratchlcv.vector[0]); } } // Read all the strings for this batch byte[] allBytes = new byte[totalLength]; int offset = 0; int len = totalLength; while (len > 0) { int bytesRead = stream.read(allBytes, offset, len); if (bytesRead < 0) { throw new EOFException("Can't finish byte read from " + stream); } len -= bytesRead; offset += bytesRead; } return allBytes; } {code} As shown above, totalLength as a Long type param is used to mark the data size. If the data size too big to over max_int, converting to int will lead to value overflow and throws the following exception: {code:java} Caused by: java.lang.NegativeArraySizeException at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197) at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99) at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274) ... 20 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org