Joris Van den Bossche created ARROW-5310: --------------------------------------------
Summary: [Python] better error message on creating ParquetDataset from empty directory Key: ARROW-5310 URL: https://issues.apache.org/jira/browse/ARROW-5310 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Currently, you get when {{path}} is an existing but empty directory: {code:python} >>> dataset = pq.ParquetDataset(path) --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-16-346f72ae525e> in <module> ----> 1 dataset = pq.ParquetDataset(path) ~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map) 989 990 if validate_schema: --> 991 self.validate_schemas() 992 993 if filters is not None: ~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self) 1025 self.schema = self.common_metadata.schema 1026 else: -> 1027 self.schema = self.pieces[0].get_metadata().schema 1028 elif self.schema is None: 1029 self.schema = self.metadata.schema IndexError: list index out of range {code} That could be a nicer error message. Unless we actually want to allow this? (although I am not sure there are good use cases of empty directories to support this, because from an empty directory we cannot get any schema or metadata information?) It is only failing when validating the schemas, so with {{validate_schema=False}} it actually returns a ParquetDataset object, just with an empty list for {{pieces}} and no schema. So it would be easy to not error when validating the schemas as well for this empty-directory case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)