[ 
https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16772080#comment-16772080
 ] 

Gabor Szadovszky commented on PARQUET-1381:
-------------------------------------------

The design of this feature has conceptional problems and also works incorrectly 
(see details below). I propose reverting this change from master (and the 
release 1.11.0). If one would like to have this feature in a later release feel 
free to working on it but I think the design concept might require to be 
changed. I am happy to help in both the design or the implementation.

The current parquet-mr API is designed for writing values in a row-by-row 
manner therefore not suitable for this feature which tries to copy the 
column-chunks value-by-value. The current implementation of this feature 
misuses the ColumnWriteStore API by writing only one column at a time and 
signaling end records. This leads to writing empty pages and other serious 
issues according to the different record counts calculated at the different API 
levels. To make this concept work some internal logic should be populated at 
page writing level (closing the page, creating the dictionary, page size 
calculation etc.) or a completely separate API should be created.

> Add merge blocks command to parquet-tools
> -----------------------------------------
>
>                 Key: PARQUET-1381
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1381
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>    Affects Versions: 1.10.0
>            Reporter: Ekaterina Galieva
>            Assignee: Ekaterina Galieva
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>
> Current implementation of merge command in parquet-tools doesn't merge row 
> groups, just places one after the other. Add API and command option to be 
> able to merge small blocks into larger ones up to specified size limit.
> h6. Implementation details:
> Blocks are not reordered not to break possible initial predicate pushdown 
> optimizations.
> Blocks are not divided to fit upper bound perfectly. 
> This is an intentional performance optimization. 
> This gives an opportunity to form new blocks by coping full content of 
> smaller blocks by column, not by row.
> h6. Examples:
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [128 | 40], [120]{code}
> Expected output file blocks sizes:
> {{merge }}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b}}
> {code:java}
> [128 | 35 | 128 | 40 | 120]
> {code}
> {{merge -b -l 256 }}
> {code:java}
> [163 | 168 | 120]
> {code}
>  # Input files with blocks sizes:
> {code:java}
> [128 | 35], [40], [120], [6] {code}
> Expected output file blocks sizes:
> {{merge}}
> {code:java}
> [128 | 35 | 40 | 120 | 6] 
> {code}
> {{merge -b}}
> {code:java}
> [128 | 75 | 126] 
> {code}
> {{merge -b -l 256}}
> {code:java}
> [203 | 126]{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to