[ 
https://issues.apache.org/jira/browse/ARROW-7545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17015062#comment-17015062
 ] 

Francois Saint-Jacques commented on ARROW-7545:
-----------------------------------------------

It looks like this is a parquet issue unrelated to Dataset. It is triggered by 
dataset because this path is not used in the python/R "parquet_to_table" 
conversion path. 


{code:c++}
diff --git a/cpp/src/parquet/column_reader.cc b/cpp/src/parquet/column_reader.cc
index 69b9bedf6..9af4a25d0 100644
--- a/cpp/src/parquet/column_reader.cc
+++ b/cpp/src/parquet/column_reader.cc
@@ -1349,7 +1349,10 @@ class ByteArrayDictionaryRecordReader : public 
TypedRecordReader<ByteArrayType>,
 
   std::shared_ptr<::arrow::ChunkedArray> GetResult() override {
     FlushBuilder();
-    return std::make_shared<::arrow::ChunkedArray>(result_chunks_, 
builder_.type());
+    std::vector<std::shared_ptr<::arrow::Array>> result;
+    std::swap(result, result_chunks_);
+    return std::make_shared<::arrow::ChunkedArray>(std::move(result),
+                                                   builder_.type());
   }
{code}


> [C++] [Dataset] Scanning dataset with dictionary type hangs
> -----------------------------------------------------------
>
>                 Key: ARROW-7545
>                 URL: https://issues.apache.org/jira/browse/ARROW-7545
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Assignee: Francois Saint-Jacques
>            Priority: Critical
>              Labels: dataset
>             Fix For: 0.16.0
>
>
> I assume it is an issue on the C++ side of the datasets code, but reproducer 
> in Python. 
> I create a small parquet file with a single column of dictionary type. 
> Reading it with {{pq.read_table}} works fine, reading it with the datasets 
> machinery hangs when scanning:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({'a': pd.Categorical(['a', 'b']*10)})
> arrow_table = pa.Table.from_pandas(df)
> filename = "test.parquet"
> pq.write_table(arrow_table, filename)
> from pyarrow.fs import LocalFileSystem
> from pyarrow.dataset import ParquetFileFormat, Dataset, 
> FileSystemDataSourceDiscovery, FileSystemDiscoveryOptions
> filesystem = LocalFileSystem()
> format = ParquetFileFormat()
> options = FileSystemDiscoveryOptions()
> discovery = FileSystemDataSourceDiscovery(
>         filesystem, [filename], format, options)
> inspected_schema = discovery.inspect()
> dataset = Dataset([discovery.finish()], inspected_schema)
> # dataset.schema works fine and gives correct schema
> dataset.schema
> scanner_builder = dataset.new_scan()
> scanner = scanner_builder.finish()
> # this hangs
> scanner.to_table()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to