[ 
https://issues.apache.org/jira/browse/ARROW-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved ARROW-3564.
--------------------------------
    Resolution: Fixed

Issue resolved by pull request 3331
[https://github.com/apache/arrow/pull/3331]

> [Python] writing version 2.0 parquet format with dictionary encoding enabled
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-3564
>                 URL: https://issues.apache.org/jira/browse/ARROW-3564
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.11.0
>            Reporter: Hatem Helal
>            Assignee: Hatem Helal
>            Priority: Major
>              Labels: parquet, pull-request-available
>             Fix For: 0.13.0
>
>         Attachments: example_v1.0_dict_False.parquet, 
> example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, 
> example_v2.0_dict_True.parquet, pyarrow_repro.py
>
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> Using pyarrow v0.11.0, the attached script writes a simple table (lifted from 
> the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both 
> parquet format versions 1.0 and 2.0, with and without dictionary encoding 
> enabled.
> Inspecting the written files using 
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
>  appears to show that dictionary encoding is not used in either of the 
> version 2.0 files.  Both files report that the columns are encoded using 
> {{PLAIN,RLE}} and that the dictionary page offset is zero.  I was expecting 
> that the column encoding would include {{RLE_DICTIONARY}}. Attached are the 
> script with repro steps and the files that were generated by it.
> Below is the output of using {{parquet-tools meta}} on the version 2.0 files
> {panel:title=version='2.0', use_dictionary = True}
> {panel}
> |{{% parquet-tools meta example_v2.0_dict_True.parquet}}
>  {{file:              
> [file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}}
>  {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}} \{ Unknown macro: 
> {extra}
>  }}
>   
>  {{file schema:       schema}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:               OPTIONAL DOUBLE R:0 D:1}}
>  {{three:             OPTIONAL BOOLEAN R:0 D:1}}
>  {{two:               OPTIONAL BINARY R:0 D:1}}
>  {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
>   
>  {{row group 1:       RC:3 TS:211 OFFSET:4}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
>  {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
> ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
>  {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
>  {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
> {panel:title=version='2.0', use_dictionary = False}
> {panel}
> |{{% parquet-tools meta example_v2.0_dict_False.parquet}}
>  {{file:              
> [file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}}
>  {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}} \{ Unknown macro: 
> {extra}
>  }}
>   
>  {{file schema:       schema}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:               OPTIONAL DOUBLE R:0 D:1}}
>  {{three:             OPTIONAL BOOLEAN R:0 D:1}}
>  {{two:               OPTIONAL BINARY R:0 D:1}}
>  {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
>   
>  {{row group 1:       RC:3 TS:211 OFFSET:4}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
>  {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
> ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
>  {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
>  {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to