Joris Van den Bossche created ARROW-7545: --------------------------------------------
Summary: [C++] Scanning dataset with dictionary type hangs Key: ARROW-7545 URL: https://issues.apache.org/jira/browse/ARROW-7545 Project: Apache Arrow Issue Type: Bug Components: C++ - Dataset Reporter: Joris Van den Bossche I assume it is an issue on the C++ side of the datasets code, but reproducer in Python. I create a small parquet file with a single column of dictionary type. Reading it with {{pq.read_table}} works fine, reading it with the datasets machinery hangs when scanning: {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)}) arrow_table = pa.Table.from_pandas(df) filename = "test.parquet" pq.write_table(arrow_table, filename) from pyarrow.fs import LocalFileSystem from pyarrow.dataset import ParquetFileFormat, Dataset, FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions filesystem = LocalFileSystem() format = ParquetFileFormat() options = FileSystemDiscoveryOptions() discovery = FileSystemDataSourceDiscovery( filesystem, [filename], format, options) inspected_schema = discovery.inspect() dataset = Dataset([discovery.finish()], inspected_schema) # dataset.schema works fine and gives correct schema dataset.schema scanner_builder = dataset.new_scan() scanner = scanner_builder.finish() # this hangs scanner.to_table() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)