Ryan Blue created PARQUET-62:
--------------------------------
Summary: DictionaryValuesWriter dictionaries are corrupted by user
changes.
Key: PARQUET-62
URL: https://issues.apache.org/jira/browse/PARQUET-62
Project: Parquet
Issue Type: Bug
Components: parquet-mr
Reporter: Ryan Blue
Assignee: Ryan Blue
Priority: Blocker
DictionaryValuesWriter passes incoming Binary objects directly to Object2IntMap
to accumulate dictionary values. If the arrays backing the Binary objects
passed in are reused by the caller, then the values are corrupted but still
written without an error.
Because Hadoop reuses objects passed to mappers and reducers, this can happen
easily. For example, Avro reuses the byte arrays backing Utf8 objects, which
parquet-avro passes wrapped in a Binary object to writeBytes.
The fix is to make defensive copies of the values passed to the Dictionary
writer code. I think this only affects the Binary dictionary classes because
Strings, floats, longs, etc. are immutable.
--
This message was sent by Atlassian JIRA
(v6.2#6252)