Joris Van den Bossche created ARROW-7545:
--------------------------------------------

             Summary: [C++] Scanning dataset with dictionary type hangs
                 Key: ARROW-7545
                 URL: https://issues.apache.org/jira/browse/ARROW-7545
             Project: Apache Arrow
          Issue Type: Bug
          Components: C++ - Dataset
            Reporter: Joris Van den Bossche


I assume it is an issue on the C++ side of the datasets code, but reproducer in 
Python. 

I create a small parquet file with a single column of dictionary type. Reading 
it with {{pq.read_table}} works fine, reading it with the datasets machinery 
hangs when scanning:

{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)})
arrow_table = pa.Table.from_pandas(df)

filename = "test.parquet"
pq.write_table(arrow_table, filename)

from pyarrow.fs import LocalFileSystem
from pyarrow.dataset import ParquetFileFormat, Dataset, 
FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions

filesystem = LocalFileSystem()
format = ParquetFileFormat()
options = FileSystemDiscoveryOptions()

discovery = FileSystemDataSourceDiscovery(
        filesystem, [filename], format, options)
inspected_schema = discovery.inspect()
dataset = Dataset([discovery.finish()], inspected_schema)

# dataset.schema works fine and gives correct schema
dataset.schema

scanner_builder = dataset.new_scan()
scanner = scanner_builder.finish()
# this hangs
scanner.to_table()
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to