XiaoLong Wu created SPARK-42715:
-----------------------------------

             Summary: NegativeArraySizeException by too many datas read from 
ORC file
                 Key: SPARK-42715
                 URL: https://issues.apache.org/jira/browse/SPARK-42715
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 3.3.2
            Reporter: XiaoLong Wu


If need more friendly exception msg about how to avoid this exception? Like 
when we catch this expetion, told user can reduce the value about 
spark.sql.orc.columnarReaderBatchSize;

In the current version, for batch reading of orc files, we use the function 
OrcColumnarBatchReader.nextBatch() to do this and depends on 
[ORC|https://github.com/apache/orc](version:1.8.2) to completed data copy, in 
ORC relevant code is as follows:
{code:java}
private static byte[] commonReadByteArrays(InStream stream, IntegerReader 
lengths,
    LongColumnVector scratchlcv,
    BytesColumnVector result, final int batchSize) throws IOException {
  // Read lengths
  scratchlcv.isRepeating = result.isRepeating;
  scratchlcv.noNulls = result.noNulls;
  scratchlcv.isNull = result.isNull;  // Notice we are replacing the isNull 
vector here...
  lengths.nextVector(scratchlcv, scratchlcv.vector, batchSize);
  int totalLength = 0;
  if (!scratchlcv.isRepeating) {
    for (int i = 0; i < batchSize; i++) {
      if (!scratchlcv.isNull[i]) {
        totalLength += (int) scratchlcv.vector[i];
      }
    }
  } else {
    if (!scratchlcv.isNull[0]) {
      totalLength = (int) (batchSize * scratchlcv.vector[0]);
    }
  }

  // Read all the strings for this batch
  byte[] allBytes = new byte[totalLength];
  int offset = 0;
  int len = totalLength;
  while (len > 0) {
    int bytesRead = stream.read(allBytes, offset, len);
    if (bytesRead < 0) {
      throw new EOFException("Can't finish byte read from " + stream);
    }
    len -= bytesRead;
    offset += bytesRead;
  }

  return allBytes;
} {code}
 As shown above, totalLength as a Long type param is used to mark the data 
size. If the data size too big to over max_int, converting to int will lead to 
value overflow and throws the following exception:
{code:java}
Caused by: java.lang.NegativeArraySizeException
    at 
org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:1998)
    at 
org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2021)
    at 
org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2119)
    at 
org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:1962)
    at 
org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
    at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
    at 
org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
    at 
org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1371)
    at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextBatch(OrcColumnarBatchReader.java:197)
    at 
org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.nextKeyValue(OrcColumnarBatchReader.java:99)
    at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116)
    at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:274)
    ... 20 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to