[ https://issues.apache.org/jira/browse/ARROW-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17324790#comment-17324790 ]
Uwe Korn commented on ARROW-12420: ---------------------------------- Thanks [~kszucs]! > [C++/Dataset] Reading null columns as dictionary not longer possible > -------------------------------------------------------------------- > > Key: ARROW-12420 > URL: https://issues.apache.org/jira/browse/ARROW-12420 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 4.0.0 > Reporter: Uwe Korn > Assignee: Krisztian Szucs > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Reading a dataset with a dictionary column where some of the files don't > contain any data for that column (and thus are typed as null) broke with > https://github.com/apache/arrow/pull/9532. It worked with the 3.0 release > though and thus I would consider this a regression. > This can be reproduced using the following Python snippet: > {code} > import pyarrow as pa > import pyarrow.parquet as pq > import pyarrow.dataset as ds > table = pa.table({"a": [None, None]}) > pq.write_table(table, "test.parquet") > schema = pa.schema([pa.field("a", pa.dictionary(pa.int32(), pa.string()))]) > fsds = ds.FileSystemDataset.from_paths( > paths=["test.parquet"], > schema=schema, > format=pa.dataset.ParquetFileFormat(), > filesystem=pa.fs.LocalFileSystem(), > ) > fsds.to_table() > {code} > The exception on master is currently: > {code} > --------------------------------------------------------------------------- > ArrowNotImplementedError Traceback (most recent call last) > <ipython-input-14-5f0bc602f16b> in <module> > 6 filesystem=pa.fs.LocalFileSystem(), > 7 ) > ----> 8 fsds.to_table() > ~/Development/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset.Dataset.to_table() > 456 table : Table instance > 457 """ > --> 458 return self._scanner(**kwargs).to_table() > 459 > 460 def head(self, int num_rows, **kwargs): > ~/Development/arrow/python/pyarrow/_dataset.pyx in > pyarrow._dataset.Scanner.to_table() > 2887 result = self.scanner.ToTable() > 2888 > -> 2889 return pyarrow_wrap_table(GetResultValue(result)) > 2890 > 2891 def take(self, object indices): > ~/Development/arrow/python/pyarrow/error.pxi in > pyarrow.lib.pyarrow_internal_check_status() > 139 cdef api int pyarrow_internal_check_status(const CStatus& status) \ > 140 nogil except -1: > --> 141 return check_status(status) > 142 > 143 > ~/Development/arrow/python/pyarrow/error.pxi in pyarrow.lib.check_status() > 116 raise ArrowKeyError(message) > 117 elif status.IsNotImplemented(): > --> 118 raise ArrowNotImplementedError(message) > 119 elif status.IsTypeError(): > 120 raise ArrowTypeError(message) > ArrowNotImplementedError: Unsupported cast from null to > dictionary<values=string, indices=int32, ordered=0> (no available cast > function for target type) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)