[jira] [Commented] (PARQUET-1826) Document hadoop configuration options

Gabor Szadovszky (Jira) Wed, 01 Apr 2020 02:36:38 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072587#comment-17072587
 ]


Gabor Szadovszky commented on PARQUET-1826:
-------------------------------------------

I was not able to find any proper documentation about the parquet writer 
version either. Let me summarize it here what it means and how parquet-mr works 
related to it.

PARQUET_2_0 refers to {{DataPageHeaderV2}} to be used in the parquet file 
(instead of {{DataPageHeader}}) to write data pages. The main difference is 
that _v2_ pages store the levels uncompressed (while _v1_ pages compress the 
levels with the data). Also, _v2_ page header does not contain any field for 
the encoding of the levels but it does not really matter as we always use (at 
least in parquet-mr) the [Run Length Encoding / Bit-Packing 
Hybrid|https://github.com/apache/parquet-format/blob/master/Encodings.md#run-length-encoding--bit-packing-hybrid-rle--3].
 See the header definitions with some more comments in 
[parquet.thrift|https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift].
 However I did not find it documented anywhere, there are other differences 
between _v1_ and _v2_ in the parquet-mr implementation. The default encodings 
of the primitive types are different. See the differences between 
[DefaultV1ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV1ValuesWriterFactory.java]
 and 
[DefaultV2ValuesWriterFactory|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/values/factory/DefaultV2ValuesWriterFactory.java].

> Document hadoop configuration options
> -------------------------------------
>
>                 Key: PARQUET-1826
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1826
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Gabor Szadovszky
>            Assignee: Walid Gara
>            Priority: Major
>
> The currently available hadoop configuration options is not documented 
> properly. The only documentation we have is the javadoc comment and the 
> implementation of 
> [ParquetOutputFormat|https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java].
> We shall investigate all the possible options and their usage/default values 
> and document them properly in a way that it is easily accessible by our users.
> I would suggest creating a `README.md` file in the sub-module 
> [parquet-hadoop|https://github.com/apache/parquet-mr/tree/master/parquet-hadoop]
>  that would describe the purpose of the module and would have a section that 
> lists the possible hadoop configuration options. (Later on we shall extend 
> this document with other descriptions about the purpose and usage of our 
> library in the hadoop ecosystem. These efforts shall be covered by other 
> jiras.)
> By adding the description to the source code it would be easy to extend it by 
> the new features we implement so it will be up-to-date for every release. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (PARQUET-1826) Document hadoop configuration options

Reply via email to