GitHub user sameeragarwal opened a pull request:

    https://github.com/apache/spark/pull/14941

    [SPARK-16334] Reusing same dictionary column for decoding consecutive row 
groups shouldn't throw an error

    ## What changes were proposed in this pull request?
    
    This patch fixes a bug in the vectorized parquet reader that's caused by 
re-using the same dictionary column vector while reading consecutive row 
groups. Specifically, this issue manifests for a certain distribution of 
dictionary/plain encoded data while we read/populate the underlying bit packed 
dictionary data into a column-vector based data structure.
    
    ## How was this patch tested?
    
    Manually tested on datasets provided by the community. Thanks to Chris 
Perluss and Keith Kraus for their invaluable help in tracking down this issue!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sameeragarwal/spark parquet-exception-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14941.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14941
    
----
commit efda29864506b4a9eb716652e0fcf5cd705c9b4c
Author: Sameer Agarwal <samee...@cs.berkeley.edu>
Date:   2016-09-02T19:03:36Z

    Reusing dictionary column vectors for reading consecutive row groups 
shouldn't throw an error

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to