[
https://issues.apache.org/jira/browse/ARROW-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444484#comment-17444484
]
Alenka Frim commented on ARROW-13781:
-------------------------------------
I am having some trouble with understanding the 'use_dictionary' vs
'col_encoding' of dictionary type "RLE_DICTIONARY" and "PLAIN_DICTIONARY". An
C++ exception is triggered when using mentioned encodings. Thought it would
help turning 'use_dictionary' to False but it doesn't.
Running this code on my working branch:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
arr_float = pa.array(list(map(float, range(100))))
arr_int = pa.array(list(map(int, range(100))))
mixed_table = pa.Table.from_arrays([arr_float, arr_int], names=['a', 'b'])
pq.write_table(mixed_table, '/users/alenkafrim/example_col_encoding_1.parquet',
use_dictionary=False, col_encoding={'a': "RLE_DICTIONARY"}){code}
terminates Python and gives this error:
{code:java}
parquet::ParquetException: Can't use dictionary encoding as fallback
encoding{code}
Does anybody have any idea what I could try?
The code with the failing test can be found on my branch:
https://github.com/AlenkaF/arrow/tree/ARROW-13781
> [Python] Allow per column encoding in parquet writer
> -----------------------------------------------------
>
> Key: ARROW-13781
> URL: https://issues.apache.org/jira/browse/ARROW-13781
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Parquet, Python
> Reporter: Brian Kiefer
> Assignee: Alenka Frim
> Priority: Minor
>
> Add a new parameter to `write_table` to allow parquet encodings to be defined
> on a per column basis. This should supercede use_dictionary and
> use_byte_stream_split.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)