[ 
https://issues.apache.org/jira/browse/IMPALA-6054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Csaba Ringhofer resolved IMPALA-6054.
-------------------------------------
       Resolution: Done
    Fix Version/s: Impala 2.11.0

IMPALA-6054: Parquet dictionary pages should be freed on dictionary construction

During dictionary constructon, most types are copied from the parquet
dictionary page, but StringValues keep pointers to it. In this case,
the dictionary page must be kept and attached to the last row batch
that references it. In case of other types, it is safe do delete
the dictionary page after the construction of the dictionary.

This patch contains two optimizations:
- dictionary pages are deleted as soon as possible for non string types
- in non-compressed and non-string case, an unnecessary copy is avoided

Change-Id: I4d9d5f4da1028d961155dafdac0028a1c3641004
Reviewed-on: http://gerrit.cloudera.org:8080/8436
Reviewed-by: Tim Armstrong <tarmstr...@cloudera.com>
Tested-by: Impala Public Jenkins

> Parquet dictionary pages should be freed on dictionary construction
> -------------------------------------------------------------------
>
>                 Key: IMPALA-6054
>                 URL: https://issues.apache.org/jira/browse/IMPALA-6054
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>    Affects Versions: Impala 2.10.0
>            Reporter: Joe McDonnell
>            Assignee: Csaba Ringhofer
>            Priority: Minor
>              Labels: resource-management
>             Fix For: Impala 2.11.0
>
>
> The Parquet scanner uses the dictionary_pool_ to allocate memory for the 
> dictionary page (see BaseScalarColumnReader::InitDictionary()). This 
> dictionary page is used to initialize the dictionary in 
> CreateDictionaryDecoder(). The resulting dictionary is a vector of values. 
> For some datatypes, such as strings, the resulting dictionary has an array of 
> StringValue's that contain pointers into the dictionary page (see the 
> StringValue specialization in ParquetPlainEncoder::Decode()). In this case, 
> the dictionary page must be kept and attached to the last row batch that 
> references it. However, for other datatypes, the values are copied into the 
> dictionary and the dictionary page is no longer needed after the dictionary 
> is constructed.
> Currently, these dictionary pages remain in the dictionary_pool_ and are 
> attached to the last row batch to be passed to other ExecNodes (see 
> FlushRowGroupResources()). This should only pass StringValue dictionary pages 
> (or other types that point to data in the page) on the row batch. The other 
> types should be freed immediately once the dictionary has been constructed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to