[ 
https://issues.apache.org/jira/browse/ARROW-13781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444484#comment-17444484
 ] 

Alenka Frim commented on ARROW-13781:
-------------------------------------

I am having some trouble with understanding the 'use_dictionary' vs 
'col_encoding' of dictionary type "RLE_DICTIONARY" and "PLAIN_DICTIONARY". An 
C++ exception is triggered when using mentioned encodings. Thought it would 
help turning 'use_dictionary' to False but it doesn't.

Running this code on my working branch:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

arr_float = pa.array(list(map(float, range(100))))
arr_int = pa.array(list(map(int, range(100))))
mixed_table = pa.Table.from_arrays([arr_float, arr_int], names=['a', 'b'])

pq.write_table(mixed_table, '/users/alenkafrim/example_col_encoding_1.parquet', 
use_dictionary=False, col_encoding={'a': "RLE_DICTIONARY"}){code}
terminates Python and gives this error: 
{code:java}
parquet::ParquetException: Can't use dictionary encoding as fallback 
encoding{code}
Does anybody have any idea what I could try?

The code with the failing test can be found on my branch:
https://github.com/AlenkaF/arrow/tree/ARROW-13781

> [Python] Allow per column encoding in parquet writer 
> -----------------------------------------------------
>
>                 Key: ARROW-13781
>                 URL: https://issues.apache.org/jira/browse/ARROW-13781
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Brian Kiefer
>            Assignee: Alenka Frim
>            Priority: Minor
>
> Add a new parameter to `write_table` to allow parquet encodings to be defined 
> on a per column basis. This should supercede use_dictionary and 
> use_byte_stream_split.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to