Joris Van den Bossche created ARROW-5310:
--------------------------------------------
Summary: [Python] better error message on creating ParquetDataset
from empty directory
Key: ARROW-5310
URL: https://issues.apache.org/jira/browse/ARROW-5310
Project: Apache Arrow
Issue Type: Bug
Components: Python
Reporter: Joris Van den Bossche
Currently, you get when {{path}} is an existing but empty directory:
{code:python}
>>> dataset = pq.ParquetDataset(path)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-16-346f72ae525e> in <module>
----> 1 dataset = pq.ParquetDataset(path)
~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, path_or_paths,
filesystem, schema, metadata, split_row_groups, validate_schema, filters,
metadata_nthreads, memory_map)
989
990 if validate_schema:
--> 991 self.validate_schemas()
992
993 if filters is not None:
~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self)
1025 self.schema = self.common_metadata.schema
1026 else:
-> 1027 self.schema = self.pieces[0].get_metadata().schema
1028 elif self.schema is None:
1029 self.schema = self.metadata.schema
IndexError: list index out of range
{code}
That could be a nicer error message.
Unless we actually want to allow this? (although I am not sure there are good
use cases of empty directories to support this, because from an empty directory
we cannot get any schema or metadata information?)
It is only failing when validating the schemas, so with
{{validate_schema=False}} it actually returns a ParquetDataset object, just
with an empty list for {{pieces}} and no schema. So it would be easy to not
error when validating the schemas as well for this empty-directory case.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)