[ https://issues.apache.org/jira/browse/ARROW-11763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290777#comment-17290777 ]
ARF commented on ARROW-11763: ----------------------------- [~westonpace] I will modify ARROW-11678 to focus on the coercing without warning to user angle. I am afraid the C++ code of arrow is far beyond my skills. I looked at it a while back and found it positively daunting to even navigate the code. > [C++] Dict index type ALWAYS gets coerced to int32 when saving to parquet > ------------------------------------------------------------------------- > > Key: ARROW-11763 > URL: https://issues.apache.org/jira/browse/ARROW-11763 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Affects Versions: 3.0.0 > Reporter: ARF > Priority: Major > > On saving a pyarrow Dictionary-type column to parquet, any non-int32 index > gets coerced to an int32 index without warning: > {code:python} > import pyarrow as pa > from pyarrow import parquet as pq > schema = pa.schema({ > 'foo': pa.dictionary(pa.int8(), pa.string(), ordered=False), > }) > def make_trivial_dict_array(dict_type, value, size): > return > table = pa.Table.from_pydict({ > 'foo': pa.DictionaryArray.from_arrays( > pa.nulls(1, schema.field('foo').type.index_type).fill_null(0), > ['bar']) > }) > pq.write_table(table, 'test_dict_int8.parquet', version='2.0', > data_page_version='2.0') > print(f"dict index type before saving to parquet: > {table.schema.field('foo').type.index_type}") > del table > table = pq.read_table('test_dict_int8.parquet') > print(f"dict index type after saving to parquet: > {table.schema.field('foo').type.index_type}") > {code} > Output: > {code:java} > dict index type before saving to parquet: int8 > dict index type after saving to parquet: int32 > {code} > While this is surprising for smaller index types, coercing an int64 index to > an int32 index without warning the user seems like asking for trouble. -- This message was sent by Atlassian Jira (v8.3.4#803005)