GitHub user sameeragarwal opened a pull request: https://github.com/apache/spark/pull/14941
[SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error ## What changes were proposed in this pull request? This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure. ## How was this patch tested? Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue! You can merge this pull request into a Git repository by running: $ git pull https://github.com/sameeragarwal/spark parquet-exception-2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/14941.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14941 ---- commit efda29864506b4a9eb716652e0fcf5cd705c9b4c Author: Sameer Agarwal <samee...@cs.berkeley.edu> Date: 2016-09-02T19:03:36Z Reusing dictionary column vectors for reading consecutive row groups shouldn't throw an error ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org