[ https://issues.apache.org/jira/browse/ARROW-3564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hatem Helal updated ARROW-3564: ------------------------------- Description: Using pyarrow v0.11.0, the following script writes a simple table (lifted from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both parquet format versions 1.0 and 2.0, with and without dictionary encoding enabled. {{import}}{{ }}{{pyarrow.parquet as pq}} {{import}}{{ }}{{numpy as np}} {{import}}{{ }}{{pandas as pd}} {{import}}{{ }}{{pyarrow as pa}} {{import}}{{ }}{{itertools}}{{}} {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}\{{-}}\{{1}}\{{, np.nan, }}\{{2.5}}{{],}} {{ }}{{'two'}}{{: [}}\{{'foo'}}\{{, }}\{{'bar'}}\{{, }}\{{'baz'}}{{],}} {{ }}{{'three'}}{{: [}}\{{True}}\{{, }}\{{False}}\{{, }}\{{True}}{{]},}} {{ }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}} {{table }}{{=}} {{pa.Table.from_pandas(df)}} {{use_dict }}{{=}} {{[}}\\{{True}}\\{{, }}\\{{False}}{{]}} {{version }}{{=}} {{[}}\\{{'1.0'}}\\{{, }}\\{{'2.0'}}{{]}} {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}} {{ }}{{filename }}{{=}} {{'example_v'}} +{{v}}+ {{'_dict_'}} +{{str}}{{(tf)}}+ {{'.parquet'}} {{ }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, version}}{{=}}{{v)}}| Inspecting the written files using [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] appears to show that dictionary encoding is not used in either of the version 2.0 files. Both files report that the columns are encoded using {{PLAIN,RLE}} and that the dictionary page offset is zero. I was expecting that the column encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro steps and the files that were generated by it. Below is the output of using {{parquet-tools meta}} on the version 2.0 files {panel:title=version='2.0', use_dictionary = True} {panel} |{{% parquet-tools meta example_v2.0_dict_True.parquet}} {{file: [file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} \{{extra: pandas = {"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"} , {"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"} , {"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"} , {"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"} ], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}} {{file schema: schema}} {{--------------------------------------------------------------------------------}} {{one: OPTIONAL DOUBLE R:0 D:1}} {{three: OPTIONAL BOOLEAN R:0 D:1}} {{two: OPTIONAL BINARY R:0 D:1}} {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} {{row group 1: RC:3 TS:211 OFFSET:4}} {{--------------------------------------------------------------------------------}} {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| {panel:title=version='2.0', use_dictionary = False} {panel} |{{% parquet-tools meta example_v2.0_dict_False.parquet}} {{file: [file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} \{{extra: pandas = {"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"} , {"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"} , {"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"} , {"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"} ], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}} {{file schema: schema}} {{--------------------------------------------------------------------------------}} {{one: OPTIONAL DOUBLE R:0 D:1}} {{three: OPTIONAL BOOLEAN R:0 D:1}} {{two: OPTIONAL BINARY R:0 D:1}} {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} {{row group 1: RC:3 TS:211 OFFSET:4}} {{--------------------------------------------------------------------------------}} {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| was: Using pyarrow v0.11.0, the following script writes a simple table (lifted from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to both parquet format versions 1.0 and 2.0, with and without dictionary encoding enabled. |{{import}} {{pyarrow.parquet as pq}} {{import}} {{numpy as np}} {{import}} {{pandas as pd}} {{import}} {{pyarrow as pa}} {{import}} {{itertools}} {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}{{-}}{{1}}{{, np.nan, }}{{2.5}}{{],}} {{ }}{{'two'}}{{: [}}{{'foo'}}{{, }}{{'bar'}}{{, }}{{'baz'}}{{],}} {{ }}{{'three'}}{{: [}}{{True}}{{, }}{{False}}{{, }}{{True}}{{]},}} {{ }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}} {{table }}{{=}} {{pa.Table.from_pandas(df)}} {{use_dict }}{{=}} {{[}}\{{True}}\{{, }}\{{False}}{{]}} {{version }}{{=}} {{[}}\{{'1.0'}}\{{, }}\{{'2.0'}}{{]}} {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}} {{ }}{{filename }}{{=}} {{'example_v'}} {+} {{v }}{+} {{'_dict_'}} {+} {{str}}{{(tf) }}{+} {{'.parquet'}} {{ }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, version}}{{=}}{{v)}}| Inspecting the written files using [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] appears to show that dictionary encoding is not used in either of the version 2.0 files. Both files report that the columns are encoded using {{PLAIN,RLE}} and that the dictionary page offset is zero. I was expecting that the column encoding would include {{RLE_DICTIONARY}}. Attached are the script with repro steps and the files that were generated by it. Below is the output of using {{parquet-tools meta}} on the version 2.0 files {panel:title=version='2.0', use_dictionary = True} {panel} |{{% parquet-tools meta example_v2.0_dict_True.parquet}} {{file: [file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} {{extra: pandas = {"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \\{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \\{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \\{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}} {{file schema: schema}} {{--------------------------------------------------------------------------------}} {{one: OPTIONAL DOUBLE R:0 D:1}} {{three: OPTIONAL BOOLEAN R:0 D:1}} {{two: OPTIONAL BINARY R:0 D:1}} {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} {{row group 1: RC:3 TS:211 OFFSET:4}} {{--------------------------------------------------------------------------------}} {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| {panel:title=version='2.0', use_dictionary = False} {panel} |{{% parquet-tools meta example_v2.0_dict_False.parquet}} {{file: [file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}} {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} {{extra: pandas = {"pandas_version": "0.23.4", "index_columns": ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", "name": "one", "numpy_type": "float64", "pandas_type": "float64"}, \\{"metadata": null, "field_name": "three", "name": "three", "numpy_type": "bool", "pandas_type": "bool"}, \\{"metadata": null, "field_name": "two", "name": "two", "numpy_type": "object", "pandas_type": "bytes"}, \\{"metadata": null, "field_name": "__index_level_0__", "name": null, "numpy_type": "object", "pandas_type": "bytes"}], "column_indexes": [\\{"metadata": null, "field_name": null, "name": null, "numpy_type": "object", "pandas_type": "bytes"}]}}} {{file schema: schema}} {{--------------------------------------------------------------------------------}} {{one: OPTIONAL DOUBLE R:0 D:1}} {{three: OPTIONAL BOOLEAN R:0 D:1}} {{two: OPTIONAL BINARY R:0 D:1}} {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} {{row group 1: RC:3 TS:211 OFFSET:4}} {{--------------------------------------------------------------------------------}} {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| > [Python] writing version 2.0 parquet format with dictionary encoding enabled > ---------------------------------------------------------------------------- > > Key: ARROW-3564 > URL: https://issues.apache.org/jira/browse/ARROW-3564 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.11.0 > Reporter: Hatem Helal > Priority: Major > Labels: parquet > Fix For: 0.13.0 > > Attachments: example_v1.0_dict_False.parquet, > example_v1.0_dict_True.parquet, example_v2.0_dict_False.parquet, > example_v2.0_dict_True.parquet, pyarrow_repro.py > > > Using pyarrow v0.11.0, the following script writes a simple table (lifted > from the [pyarrow doc|https://arrow.apache.org/docs/python/parquet.html]) to > both parquet format versions 1.0 and 2.0, with and without dictionary > encoding enabled. > > > {{import}}{{ }}{{pyarrow.parquet as pq}} > {{import}}{{ }}{{numpy as np}} > {{import}}{{ }}{{pandas as pd}} > {{import}}{{ }}{{pyarrow as pa}} > {{import}}{{ }}{{itertools}}{{}} > {{df }}{{=}} {{pd.DataFrame({}}{{'one'}}{{: [}}\{{-}}\{{1}}\{{, np.nan, > }}\{{2.5}}{{],}} > {{ }}{{'two'}}{{: [}}\{{'foo'}}\{{, }}\{{'bar'}}\{{, }}\{{'baz'}}{{],}} > {{ }}{{'three'}}{{: [}}\{{True}}\{{, }}\{{False}}\{{, }}\{{True}}{{]},}} > {{ }}{{index}}{{=}}{{list}}{{(}}{{'abc'}}{{))}} > > {{table }}{{=}} {{pa.Table.from_pandas(df)}} > > {{use_dict }}{{=}} {{[}}\\{{True}}\\{{, }}\\{{False}}{{]}} > {{version }}{{=}} {{[}}\\{{'1.0'}}\\{{, }}\\{{'2.0'}}{{]}} > > {{for}} {{tf, v }}{{in}} {{itertools.product(use_dict, version):}} > {{ }}{{filename }}{{=}} {{'example_v'}} +{{v}}+ {{'_dict_'}} > +{{str}}{{(tf)}}+ {{'.parquet'}} > {{ }}{{pq.write_table(table, filename, use_dictionary}}{{=}}{{tf, > version}}{{=}}{{v)}}| > Inspecting the written files using > [parquet-tools|https://github.com/apache/parquet-mr/tree/master/parquet-tools] > appears to show that dictionary encoding is not used in either of the > version 2.0 files. Both files report that the columns are encoded using > {{PLAIN,RLE}} and that the dictionary page offset is zero. I was expecting > that the column encoding would include {{RLE_DICTIONARY}}. Attached are the > script with repro steps and the files that were generated by it. > Below is the output of using {{parquet-tools meta}} on the version 2.0 files > {panel:title=version='2.0', use_dictionary = True} > {panel} > |{{% parquet-tools meta example_v2.0_dict_True.parquet}} > {{file: > [file:.../example_v2.0_dict_True.parquet|file:///.../example_v2.0_dict_True.parquet]}} > {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} > \{{extra: pandas = {"pandas_version": "0.23.4", "index_columns": > ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", > "name": "one", "numpy_type": "float64", "pandas_type": "float64"} > , {"metadata": null, "field_name": "three", "name": "three", "numpy_type": > "bool", "pandas_type": "bool"} > , {"metadata": null, "field_name": "two", "name": "two", "numpy_type": > "object", "pandas_type": "bytes"} > , {"metadata": null, "field_name": "__index_level_0__", "name": null, > "numpy_type": "object", "pandas_type": "bytes"} > ], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, > "numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}} > > {{file schema: schema}} > > {{--------------------------------------------------------------------------------}} > {{one: OPTIONAL DOUBLE R:0 D:1}} > {{three: OPTIONAL BOOLEAN R:0 D:1}} > {{two: OPTIONAL BINARY R:0 D:1}} > {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} > > {{row group 1: RC:3 TS:211 OFFSET:4}} > > {{--------------------------------------------------------------------------------}} > {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 > ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} > {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 > ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} > {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 > ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} > {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 > ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| > {panel:title=version='2.0', use_dictionary = False} > {panel} > |{{% parquet-tools meta example_v2.0_dict_False.parquet}} > {{file: > [file:.../example_v2.0_dict_False.parquet|file:///.../example_v2.0_dict_False.parquet]}} > {{creator: parquet-cpp version 1.5.1-SNAPSHOT}} > \{{extra: pandas = {"pandas_version": "0.23.4", "index_columns": > ["__index_level_0__"], "columns": [\{"metadata": null, "field_name": "one", > "name": "one", "numpy_type": "float64", "pandas_type": "float64"} > , {"metadata": null, "field_name": "three", "name": "three", "numpy_type": > "bool", "pandas_type": "bool"} > , {"metadata": null, "field_name": "two", "name": "two", "numpy_type": > "object", "pandas_type": "bytes"} > , {"metadata": null, "field_name": "__index_level_0__", "name": null, > "numpy_type": "object", "pandas_type": "bytes"} > ], "column_indexes": [\\\{"metadata": null, "field_name": null, "name": null, > "numpy_type": "object", "pandas_type": "bytes"}|file://%7B/]}}} > > {{file schema: schema}} > > {{--------------------------------------------------------------------------------}} > {{one: OPTIONAL DOUBLE R:0 D:1}} > {{three: OPTIONAL BOOLEAN R:0 D:1}} > {{two: OPTIONAL BINARY R:0 D:1}} > {{__index_level_0__: OPTIONAL BINARY R:0 D:1}} > > {{row group 1: RC:3 TS:211 OFFSET:4}} > > {{--------------------------------------------------------------------------------}} > {{one: DOUBLE SNAPPY DO:0 FPO:4 SZ:65/63/0.97 VC:3 > ENC:PLAIN,RLE ST:[min: -1.0, max: 2.5, num_nulls: 1]}} > {{three: BOOLEAN SNAPPY DO:0 FPO:142 SZ:36/34/0.94 VC:3 > ENC:PLAIN,RLE ST:[min: false, max: true, num_nulls: 0]}} > {{two: BINARY SNAPPY DO:0 FPO:225 SZ:60/58/0.97 VC:3 > ENC:PLAIN,RLE ST:[min: 0x626172, max: 0x666F6F, num_nulls: 0]}} > {{__index_level_0__: BINARY SNAPPY DO:0 FPO:328 SZ:50/48/0.96 VC:3 > ENC:PLAIN,RLE ST:[min: 0x61, max: 0x63, num_nulls: 0]}}| -- This message was sent by Atlassian JIRA (v7.6.3#76005)