George Sakkis created ARROW-4076: ------------------------------------ Summary: [Python] schema validation and filters Key: ARROW-4076 URL: https://issues.apache.org/jira/browse/ARROW-4076 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: George Sakkis
Currently [schema validation|https://github.com/apache/arrow/blob/758bd557584107cb336cbc3422744dacd93978af/python/pyarrow/parquet.py#L900] of {{ParquetDataset}} takes place before filtering. This may raise a {{ValueError}}if the schema is different in some dataset pieces, even if these pieces would be subsequently filtered out. I think validation should happen after filtering to prevent such spurious errors: {noformat} --- a/pyarrow/parquet.py +++ b/pyarrow/parquet.py @@ -878,13 +878,13 @@ if split_row_groups: raise NotImplementedError("split_row_groups not yet implemented") - if validate_schema: - self.validate_schemas() - if filters is not None: filters = _check_filters(filters) self._filter(filters) + if validate_schema: + self.validate_schemas() + def validate_schemas(self): open_file = self._get_open_file_func() {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)