[ 
https://issues.apache.org/jira/browse/PARQUET-1784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17032196#comment-17032196
 ] 

Gabor Szadovszky commented on PARQUET-1784:
-------------------------------------------

[~garawalid],

Thanks for the research and the examples.

If one would like to set some parquet specific configuration it needs to 
consult the Parquet documentations to know which key is to be used and which 
values are allowed. Therefore,  I don't think Parquet needs to follow the 
existing configurations of other components.

What I would like to implement here is to have a common way for setting the 
configuration of different columns. Let's check the following example. We would 
like to set the encoding of some specific columns while we would like to keep 
the encoding of the other columns to be selected automatically . We might 
configure it the following way using lists.
{code:java}
conf.setStrings("parquet.encoding.columns", "float_col", "double_col");
conf.setStrings("parquet.encoding", "byte_stream_split", "byte_stream_split");
{code}
Or, we use the pattern described in this jira:
{code:java}
conf.set("parquet.encoding#float_col", "byte_stream_split");
conf.set("parquet.encoding#double_col", "byte_stream_split");
{code}
I think, the latter is cleaner and less error prone. Moreover, 
{{org.apache.hadoop.conf.Configuration}} only gives support for string lists 
while in the latter case you can use any value type supported by 
{{org.apache.hadoop.conf.Configuration}} in a clean way.

What do you think? If you have time, you may also like to check my PR as well?

> Column-wise configuration
> -------------------------
>
>                 Key: PARQUET-1784
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1784
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Gabor Szadovszky
>            Assignee: Gabor Szadovszky
>            Priority: Major
>              Labels: pull-request-available
>
> After adding some new statistics and encodings into Parquet it is getting 
> very hard to be smart and choose the best configs automatically. For example 
> for which columns should we save column index and/or bloom-filters? Is it 
> worth using dictionary for a column that we know will fall back to another 
> encoding?
> The idea of this feature is to allow the library user to fine-tune the 
> configuration by setting it column-wise. To support this we extend the 
> existing configuration keys by a suffix to identify the related column. (From 
> now on we introduce new keys following the same syntax.)
>  \{key of the configuration}{{#}}\{column path in the file schema}
>  For example: {{parquet.enable.dictionary#column.path.col_1}}
> This jira covers the framework to support the column-wise configuration with 
> the implementation of some existing configs where it make sense (e.g. 
> {{parquet.enable.dictionary}}). Implementing new configuration is not part of 
> this effort.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to