GitHub user sameeragarwal opened a pull request:
https://github.com/apache/spark/pull/14941
[SPARK-16334] Reusing same dictionary column for decoding consecutive row
groups shouldn't throw an error
## What changes were proposed in this pull request?
This patch fixes a bug in the vectorized parquet reader that's caused by
re-using the same dictionary column vector while reading consecutive row
groups. Specifically, this issue manifests for a certain distribution of
dictionary/plain encoded data while we read/populate the underlying bit packed
dictionary data into a column-vector based data structure.
## How was this patch tested?
Manually tested on datasets provided by the community. Thanks to Chris
Perluss and Keith Kraus for their invaluable help in tracking down this issue!
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sameeragarwal/spark parquet-exception-2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14941.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14941
commit efda29864506b4a9eb716652e0fcf5cd705c9b4c
Author: Sameer Agarwal
Date: 2016-09-02T19:03:36Z
Reusing dictionary column vectors for reading consecutive row groups
shouldn't throw an error
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org