GitHub user ala opened a pull request:

    https://github.com/apache/spark/pull/21206

    [SPARK-24133][SQL] Check for integer overflows when resizing 
WritableColumnVectors

    ## What changes were proposed in this pull request?
    
    `ColumnVector`s store string data in one big byte array. Since the array 
size is capped at just under Integer.MAX_VALUE, a single `ColumnVector` cannot 
store more than 2GB of string data.
    But since the Parquet files commonly contain large blobs stored as strings, 
and `ColumnVector`s by default carry 4096 values, it's entirely possible to go 
past that limit. In such cases a negative capacity is requested from 
`WritableColumnVector.reserve()`. The call succeeds (requested capacity is 
smaller than already allocated capacity), and consequently 
`java.lang.ArrayIndexOutOfBoundsException` is thrown when the reader actually 
attempts to put the data into the array.
    
    This change introduces a simple check for integer overflow to 
`WritableColumnVector.reserve()` which should help catch the error earlier and 
provide more informative exception. Additionally, the error message in 
`WritableColumnVector.throwUnsupportedException()` was corrected, as it 
previously encouraged users to increase rather than reduce the batch size.
    
    ## How was this patch tested?
    
    New units tests were added.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ala/spark overflow-reserve

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21206.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21206
    
----
commit d754175e2fb853befd807578c269aabafb311802
Author: Ala Luszczak <ala@...>
Date:   2018-05-01T09:29:31Z

    add check for negative capacity, better error msg, tests

commit 17e2d0270c3edfa9a7fcfd602283eb916b5e8f6a
Author: Ala Luszczak <ala@...>
Date:   2018-05-01T09:35:58Z

    include defaults for reference

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to