[jira] [Updated] (ARROW-3564) [Python] writing version 2.0 parquet format with dictionary encoding enabled

Hatem Helal (JIRA) Fri, 19 Oct 2018 06:05:30 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hatem Helal updated ARROW-3564:
-------------------------------
    Description: 
Using pyarrow v0.11.0, the following script writes a simple table (lifted from 
the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both 
parquet format versions 1.0 and 2.0, with and without dictionary encoding 
enabled.
 
 
{{import}}{{ }}{{pyarrow.parquet as pq}}
{{import}}{{ }}{{numpy as np}}
{{import}}{{ }}{{pandas as pd}}
{{import}}{{ }}{{pyarrow as pa}}
{{import}}{{ }}{{itertools}}{{}}

{{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}\{{-}}\{{1}}\{{, np.nan, 
}}\{{2.5}}{{],}}
 {{    }}{{'two'}}{{: [}}\{{'foo'}}\{{, }}\{{'bar'}}\{{, }}\{{'baz'}}{{],}}
 {{    }}{{'three'}}{{: [}}\{{True}}\{{, }}\{{False}}\{{, }}\{{True}}{{]},}}
 {{    }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}}
  
 {{table }}{{=}} {{pa.Table.from_pandas(df)}}
  
 {{use_dict }}{{=}} {{[}}\\{{True}}\\{{, }}\\{{False}}{{]}}
 {{version }}{{=}} {{[}}\\{{'1.0'}}\\{{, }}\\{{'2.0'}}{{]}}
  
 {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}}
 {{    }}{{filename }}{{=}} {{'example_v'}}  +{{v}}+   {{'_dict_'}}  
+{{str}}{{(tf)}}+  {{'.parquet'}}
 {{    }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, 
version}}{{=}}{{v)}}|

Inspecting the written files using 
[parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] 
appears to show that dictionary encoding is not used in either of the version 
2.0 files.  Both files report that the columns are encoded using {{PLAIN,RLE}} 
and that the dictionary page offset is zero.  I was expecting that the column 
encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro 
steps and the files that were generated by it.

Below is the output of using {{parquet-tools meta}} on the version 2.0 files
{panel:title=version='2.0', use_dictionary = True}
{panel}
|{{% parquet-tools meta example_v2.0_dict_True.parquet}}
 {{file:              
[file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}}
 {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
 \{{extra:             pandas = {"pandas_version": "0.23.4", "index_columns": 
["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", 
"name": "one", "numpy_type": "float64", "pandas_type": "float64"}
, {"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
"bool", "pandas_type": "bool"}
, {"metadata": null, "field_name": "two", "name": "two", "numpy_type": 
"object", "pandas_type": "bytes"}
, {"metadata": null, "field_name": "__index_level_0__", "name": null, 
"numpy_type": "object", "pandas_type": "bytes"}
], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, 
"numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}}
  
 {{file schema:       schema}}
 
{{--------------------------------------------------------------------------------}}
 {{one:               OPTIONAL DOUBLE R:0 D:1}}
 {{three:             OPTIONAL BOOLEAN R:0 D:1}}
 {{two:               OPTIONAL BINARY R:0 D:1}}
 {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
  
 {{row group 1:       RC:3 TS:211 OFFSET:4}}
 
{{--------------------------------------------------------------------------------}}
 {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
 {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
 {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
 {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
{panel:title=version='2.0', use_dictionary = False}
{panel}
|{{% parquet-tools meta example_v2.0_dict_False.parquet}}
 {{file:              
[file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}}
 {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
 \{{extra:             pandas = {"pandas_version": "0.23.4", "index_columns": 
["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", 
"name": "one", "numpy_type": "float64", "pandas_type": "float64"}
, {"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
"bool", "pandas_type": "bool"}
, {"metadata": null, "field_name": "two", "name": "two", "numpy_type": 
"object", "pandas_type": "bytes"}
, {"metadata": null, "field_name": "__index_level_0__", "name": null, 
"numpy_type": "object", "pandas_type": "bytes"}
], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, 
"numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}}
  
 {{file schema:       schema}}
 
{{--------------------------------------------------------------------------------}}
 {{one:               OPTIONAL DOUBLE R:0 D:1}}
 {{three:             OPTIONAL BOOLEAN R:0 D:1}}
 {{two:               OPTIONAL BINARY R:0 D:1}}
 {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
  
 {{row group 1:       RC:3 TS:211 OFFSET:4}}
 
{{--------------------------------------------------------------------------------}}
 {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
 {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
 {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
 {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|

  was:
Using pyarrow v0.11.0, the following script writes a simple table (lifted from 
the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both 
parquet format versions 1.0 and 2.0, with and without dictionary encoding 
enabled.
|{{import}} {{pyarrow.parquet as pq}}
 {{import}} {{numpy as np}}
 {{import}} {{pandas as pd}}
 {{import}} {{pyarrow as pa}}
 {{import}} {{itertools}}
  

{{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan, 
}}{{2.5}}{{],}}
{{    }}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}}
{{    }}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}}
{{    }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}}
  
 {{table }}{{=}} {{pa.Table.from_pandas(df)}}
  
 {{use_dict }}{{=}} {{[}}\{{True}}\{{, }}\{{False}}{{]}}
 {{version }}{{=}} {{[}}\{{'1.0'}}\{{, }}\{{'2.0'}}{{]}}
  
 {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}}
 {{    }}{{filename }}{{=}} {{'example_v'}} {+} {{v  }}{+} {{'_dict_'}} {+} 
{{str}}{{(tf) }}{+} {{'.parquet'}}
 {{    }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, 
version}}{{=}}{{v)}}|

Inspecting the written files using 
[parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] 
appears to show that dictionary encoding is not used in either of the version 
2.0 files.  Both files report that the columns are encoded using {{PLAIN,RLE}} 
and that the dictionary page offset is zero.  I was expecting that the column 
encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro 
steps and the files that were generated by it.

Below is the output of using {{parquet-tools meta}} on the version 2.0 files
{panel:title=version='2.0', use_dictionary = True}
{panel}
|{{% parquet-tools meta example_v2.0_dict_True.parquet}}
 {{file:              
[file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}}
 {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
 {{extra:             pandas = {"pandas_version": "0.23.4", "index_columns": 
["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", 
"name": "one", "numpy_type": "float64", "pandas_type": "float64"}, 
\\{"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
"bool", "pandas_type": "bool"}, \\{"metadata": null, "field_name": "two", 
"name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \\{"metadata": 
null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", 
"pandas_type": "bytes"}], "column_indexes": [\\{"metadata": null, "field_name": 
null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}}
  
 {{file schema:       schema}}
 
{{--------------------------------------------------------------------------------}}
 {{one:               OPTIONAL DOUBLE R:0 D:1}}
 {{three:             OPTIONAL BOOLEAN R:0 D:1}}
 {{two:               OPTIONAL BINARY R:0 D:1}}
 {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
  
 {{row group 1:       RC:3 TS:211 OFFSET:4}}
 
{{--------------------------------------------------------------------------------}}
 {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
 {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
 {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
 {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
{panel:title=version='2.0', use_dictionary = False}
{panel}
|{{% parquet-tools meta example_v2.0_dict_False.parquet}}
 {{file:              
[file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}}
 {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
 {{extra:             pandas = {"pandas_version": "0.23.4", "index_columns": 
["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", 
"name": "one", "numpy_type": "float64", "pandas_type": "float64"}, 
\\{"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
"bool", "pandas_type": "bool"}, \\{"metadata": null, "field_name": "two", 
"name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \\{"metadata": 
null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", 
"pandas_type": "bytes"}], "column_indexes": [\\{"metadata": null, "field_name": 
null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}}
  
 {{file schema:       schema}}
 
{{--------------------------------------------------------------------------------}}
 {{one:               OPTIONAL DOUBLE R:0 D:1}}
 {{three:             OPTIONAL BOOLEAN R:0 D:1}}
 {{two:               OPTIONAL BINARY R:0 D:1}}
 {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
  
 {{row group 1:       RC:3 TS:211 OFFSET:4}}
 
{{--------------------------------------------------------------------------------}}
 {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
 {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
 {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
 {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|


> [Python] writing version 2.0 parquet format with dictionary encoding enabled
> ----------------------------------------------------------------------------
>
>                 Key: ARROW-3564
>                 URL: https://issues.apache.org/jira/browse/ARROW-3564
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.11.0
>            Reporter: Hatem Helal
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.13.0
>
>         Attachments: example_v1.0_dict_False.parquet, 
> example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, 
> example_v2.0_dict_True.parquet, pyarrow_repro.py
>
>
> Using pyarrow v0.11.0, the following script writes a simple table (lifted 
> from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to 
> both parquet format versions 1.0 and 2.0, with and without dictionary 
> encoding enabled.
>  
>  
> {{import}}{{ }}{{pyarrow.parquet as pq}}
> {{import}}{{ }}{{numpy as np}}
> {{import}}{{ }}{{pandas as pd}}
> {{import}}{{ }}{{pyarrow as pa}}
> {{import}}{{ }}{{itertools}}{{}}
> {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}\{{-}}\{{1}}\{{, np.nan, 
> }}\{{2.5}}{{],}}
>  {{    }}{{'two'}}{{: [}}\{{'foo'}}\{{, }}\{{'bar'}}\{{, }}\{{'baz'}}{{],}}
>  {{    }}{{'three'}}{{: [}}\{{True}}\{{, }}\{{False}}\{{, }}\{{True}}{{]},}}
>  {{    }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}}
>   
>  {{table }}{{=}} {{pa.Table.from_pandas(df)}}
>   
>  {{use_dict }}{{=}} {{[}}\\{{True}}\\{{, }}\\{{False}}{{]}}
>  {{version }}{{=}} {{[}}\\{{'1.0'}}\\{{, }}\\{{'2.0'}}{{]}}
>   
>  {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}}
>  {{    }}{{filename }}{{=}} {{'example_v'}}  +{{v}}+   {{'_dict_'}}  
> +{{str}}{{(tf)}}+  {{'.parquet'}}
>  {{    }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, 
> version}}{{=}}{{v)}}|
> Inspecting the written files using 
> [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools]
>  appears to show that dictionary encoding is not used in either of the 
> version 2.0 files.  Both files report that the columns are encoded using 
> {{PLAIN,RLE}} and that the dictionary page offset is zero.  I was expecting 
> that the column encoding would include {{RLE_DICTIONARY}}. Attached are the 
> script with repro steps and the files that were generated by it.
> Below is the output of using {{parquet-tools meta}} on the version 2.0 files
> {panel:title=version='2.0', use_dictionary = True}
> {panel}
> |{{% parquet-tools meta example_v2.0_dict_True.parquet}}
>  {{file:              
> [file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}}
>  {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
>  \{{extra:             pandas = {"pandas_version": "0.23.4", "index_columns": 
> ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", 
> "name": "one", "numpy_type": "float64", "pandas_type": "float64"}
> , {"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
> "bool", "pandas_type": "bool"}
> , {"metadata": null, "field_name": "two", "name": "two", "numpy_type": 
> "object", "pandas_type": "bytes"}
> , {"metadata": null, "field_name": "__index_level_0__", "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}
> ], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}}
>   
>  {{file schema:       schema}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:               OPTIONAL DOUBLE R:0 D:1}}
>  {{three:             OPTIONAL BOOLEAN R:0 D:1}}
>  {{two:               OPTIONAL BINARY R:0 D:1}}
>  {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
>   
>  {{row group 1:       RC:3 TS:211 OFFSET:4}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
>  {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
> ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
>  {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
>  {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|
> {panel:title=version='2.0', use_dictionary = False}
> {panel}
> |{{% parquet-tools meta example_v2.0_dict_False.parquet}}
>  {{file:              
> [file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}}
>  {{creator:           parquet-cpp version 1.5.1-SNAPSHOT}}
>  \{{extra:             pandas = {"pandas_version": "0.23.4", "index_columns": 
> ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", 
> "name": "one", "numpy_type": "float64", "pandas_type": "float64"}
> , {"metadata": null, "field_name": "three", "name": "three", "numpy_type": 
> "bool", "pandas_type": "bool"}
> , {"metadata": null, "field_name": "two", "name": "two", "numpy_type": 
> "object", "pandas_type": "bytes"}
> , {"metadata": null, "field_name": "__index_level_0__", "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}
> ], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, 
> "numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}}
>   
>  {{file schema:       schema}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:               OPTIONAL DOUBLE R:0 D:1}}
>  {{three:             OPTIONAL BOOLEAN R:0 D:1}}
>  {{two:               OPTIONAL BINARY R:0 D:1}}
>  {{__index_level_0__: OPTIONAL BINARY R:0 D:1}}
>   
>  {{row group 1:       RC:3 TS:211 OFFSET:4}}
>  
> {{--------------------------------------------------------------------------------}}
>  {{one:                DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}}
>  {{three:              BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 
> ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}}
>  {{two:                BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}}
>  {{__index_level_0__:  BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 
> ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-3564) [Python] writing version 2.0 parquet format with dictionary encoding enabled

Reply via email to