spark git commit: [SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors

lixiao Wed, 02 May 2018 12:43:58 -0700

Repository: spark
Updated Branches:
  refs/heads/master 8dbf56c05 -> 8bd27025b



[SPARK-24133][SQL] Check for integer overflows when resizing 
WritableColumnVectors

## What changes were proposed in this pull request?

`ColumnVector`s store string data in one big byte array. Since the array size 
is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store 
more than 2GB of string data.
But since the Parquet files commonly contain large blobs stored as strings, and 
`ColumnVector`s by default carry 4096 values, it's entirely possible to go past 
that limit. In such cases a negative capacity is requested from 
`WritableColumnVector.reserve()`. The call succeeds (requested capacity is 
smaller than already allocated capacity), and consequently 
`java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually 
attempts to put the data into the array.

This change introduces a simple check for integer overflow to 
`WritableColumnVector.reserve()` which should help catch the error earlier and 
provide more informative exception. Additionally, the error message in 
`WritableColumnVector.throwUnsupportedException()` was corrected, as it 
previously encouraged users to increase rather than reduce the batch size.

## How was this patch tested?

New units tests were added.

Author: Ala Luszczak <a...@databricks.com>

Closes #21206 from ala/overflow-reserve.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/8bd27025
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/8bd27025
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/8bd27025

Branch: refs/heads/master
Commit: 8bd27025b7cf0b44726b6f4020d294ef14dbbb7e
Parents: 8dbf56c
Author: Ala Luszczak <a...@databricks.com>
Authored: Wed May 2 12:43:19 2018 -0700
Committer: gatorsmile <gatorsm...@gmail.com>
Committed: Wed May 2 12:43:19 2018 -0700

----------------------------------------------------------------------
 .../vectorized/WritableColumnVector.java        | 21 ++++++++++++--------
 .../vectorized/ColumnarBatchSuite.scala         |  7 +++++++
 2 files changed, 20 insertions(+), 8 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/8bd27025/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
----------------------------------------------------------------------
diff --git 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
index 5275e4a..b0e119d 100644
--- 
a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
+++ 
b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java
@@ -81,7 +81,9 @@ public abstract class WritableColumnVector extends 
ColumnVector {
   }
 
   public void reserve(int requiredCapacity) {
-    if (requiredCapacity > capacity) {
+    if (requiredCapacity < 0) {
+      throwUnsupportedException(requiredCapacity, null);
+    } else if (requiredCapacity > capacity) {
       int newCapacity = (int) Math.min(MAX_CAPACITY, requiredCapacity * 2L);
       if (requiredCapacity <= newCapacity) {
         try {
@@ -96,13 +98,16 @@ public abstract class WritableColumnVector extends 
ColumnVector {
   }
 
   private void throwUnsupportedException(int requiredCapacity, Throwable 
cause) {
-    String message = "Cannot reserve additional contiguous bytes in the 
vectorized reader " +
-        "(requested = " + requiredCapacity + " bytes). As a workaround, you 
can disable the " +
-        "vectorized reader, or increase the vectorized reader batch size. For 
parquet file " +
-        "format, refer to " + 
SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + " and " +
-        SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().key() + "; for orc file 
format, refer to " +
-        SQLConf.ORC_VECTORIZED_READER_ENABLED().key() + " and " +
-        SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().key() + ".";
+    String message = "Cannot reserve additional contiguous bytes in the 
vectorized reader (" +
+        (requiredCapacity >= 0 ? "requested " + requiredCapacity + " bytes" : 
"integer overflow") +
+        "). As a workaround, you can reduce the vectorized reader batch size, 
or disable the " +
+        "vectorized reader. For parquet file format, refer to " +
+        SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().key() +
+        " (default " + 
SQLConf.PARQUET_VECTORIZED_READER_BATCH_SIZE().defaultValueString() +
+        ") and " + SQLConf.PARQUET_VECTORIZED_READER_ENABLED().key() + "; for 
orc file format, " +
+        "refer to " + SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().key() +
+        " (default " + 
SQLConf.ORC_VECTORIZED_READER_BATCH_SIZE().defaultValueString() +
+        ") and " + SQLConf.ORC_VECTORIZED_READER_ENABLED().key() + ".";
     throw new RuntimeException(message, cause);
   }
 

http://git-wip-us.apache.org/repos/asf/spark/blob/8bd27025/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
----------------------------------------------------------------------
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
index 772f687..f57f07b 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala
@@ -1333,4 +1333,11 @@ class ColumnarBatchSuite extends SparkFunSuite {
 
       column.close()
   }
+
+  testVector("WritableColumnVector.reserve(): requested capacity is negative", 
1024, ByteType) {
+    column =>
+      val ex = intercept[RuntimeException] { column.reserve(-1) }
+      assert(ex.getMessage.contains(
+          "Cannot reserve additional contiguous bytes in the vectorized reader 
(integer overflow)"))
+  }
 }


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors

Reply via email to