Joris Van den Bossche created ARROW-8799: --------------------------------------------
Summary: [C++][Dataset] Reading list column as nested dictionary segfaults Key: ARROW-8799 URL: https://issues.apache.org/jira/browse/ARROW-8799 Project: Apache Arrow Issue Type: Bug Components: C++ Reporter: Joris Van den Bossche Python example: {code} import pyarrow as pa import pyarrow.parquet as pq from pyarrow.tests import util repeats = 10 nunique = 5 data = [ [[util.rands(10)] for i in range(nunique)] * repeats, ] table = pa.table(data, names=['f0']) pq.write_table(table, "test_dictionary.parquet") {code} Reading with the parquet code works: {code} >>> pq.read_table("test_dictionary.parquet", read_dictionary=['f0.list.item']) >>> >>> pyarrow.Table f0: list<item: dictionary<values=string, indices=int32, ordered=0>> child 0, item: dictionary<values=string, indices=int32, ordered=0> {code} but doing the same with the datasets API segfaults: {code} >>> fmt = >>> ds.ParquetFileFormat(read_options=dict(dictionary_columns=["f0.list.item"])) >>> dataset = ds.dataset("test_dictionary.parquet", format=fmt) >>> >>> dataset.to_table() Segmentation fault (core dumped) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)