[ https://issues.apache.org/jira/browse/ARROW-7545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015062#comment-17015062 ]
Francois Saint-Jacques commented on ARROW-7545: ----------------------------------------------- It looks like this is a parquet issue unrelated to Dataset. It is triggered by dataset because this path is not used in the python/R "parquet_to_table" conversion path. {code:c++} diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc index 69b9bedf6..9af4a25d0 100644 --- a/cpp/src/parquet/column_reader.cc +++ b/cpp/src/parquet/column_reader.cc @@ -1349,7 +1349,10 @@ class ByteArrayDictionaryRecordReader : public TypedRecordReader<ByteArrayType>, std::shared_ptr<::arrow::ChunkedArray> GetResult() override { FlushBuilder(); - return std::make_shared<::arrow::ChunkedArray>(result_chunks_, builder_.type()); + std::vector<std::shared_ptr<::arrow::Array>> result; + std::swap(result, result_chunks_); + return std::make_shared<::arrow::ChunkedArray>(std::move(result), + builder_.type()); } {code} > [C++] [Dataset] Scanning dataset with dictionary type hangs > ----------------------------------------------------------- > > Key: ARROW-7545 > URL: https://issues.apache.org/jira/browse/ARROW-7545 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Joris Van den Bossche > Assignee: Francois Saint-Jacques > Priority: Critical > Labels: dataset > Fix For: 0.16.0 > > > I assume it is an issue on the C++ side of the datasets code, but reproducer > in Python. > I create a small parquet file with a single column of dictionary type. > Reading it with {{pq.read_table}} works fine, reading it with the datasets > machinery hangs when scanning: > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)}) > arrow_table = pa.Table.from_pandas(df) > filename = "test.parquet" > pq.write_table(arrow_table, filename) > from pyarrow.fs import LocalFileSystem > from pyarrow.dataset import ParquetFileFormat, Dataset, > FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions > filesystem = LocalFileSystem() > format = ParquetFileFormat() > options = FileSystemDiscoveryOptions() > discovery = FileSystemDataSourceDiscovery( > filesystem, [filename], format, options) > inspected_schema = discovery.inspect() > dataset = Dataset([discovery.finish()], inspected_schema) > # dataset.schema works fine and gives correct schema > dataset.schema > scanner_builder = dataset.new_scan() > scanner = scanner_builder.finish() > # this hangs > scanner.to_table() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)