GitHub user ala opened a pull request: https://github.com/apache/spark/pull/21206
[SPARK-24133][SQL] Check for integer overflows when resizing WritableColumnVectors ## What changes were proposed in this pull request? `ColumnVector`s store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot store more than 2GB of string data. But since the Parquet files commonly contain large blobs stored as strings, and `ColumnVector`s by default carry 4096 values, it's entirely possible to go past that limit. In such cases a negative capacity is requested from `WritableColumnVector.reserve()`. The call succeeds (requested capacity is smaller than already allocated capacity), and consequently `java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually attempts to put the data into the array. This change introduces a simple check for integer overflow to `WritableColumnVector.reserve()` which should help catch the error earlier and provide more informative exception. Additionally, the error message in `WritableColumnVector.throwUnsupportedException()` was corrected, as it previously encouraged users to increase rather than reduce the batch size. ## How was this patch tested? New units tests were added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ala/spark overflow-reserve Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21206.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21206 ---- commit d754175e2fb853befd807578c269aabafb311802 Author: Ala Luszczak <ala@...> Date: 2018-05-01T09:29:31Z add check for negative capacity, better error msg, tests commit 17e2d0270c3edfa9a7fcfd602283eb916b5e8f6a Author: Ala Luszczak <ala@...> Date: 2018-05-01T09:35:58Z include defaults for reference ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org