[ https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354594#comment-16354594 ]
Uwe L. Korn commented on ARROW-2079: ------------------------------------ 1) Yes this should be {{common_metadata_path}} but we could also use here {{metadata_path}} is {{common_metadata_path}} is None. 2) yes, this also looks fine. {{_common_metadata}} is simply a Parquet file that contains the schema of all files that are part of a dataset. {{_metadata}} is the same but it additionally contains the metadata of each RowGroup that is part of the dataset, i.e. {{_metadata}} is a superset from {{_common_metadata}}. > Possibly use `_common_metadata` for schema if `_metadata` isn't available > ------------------------------------------------------------------------- > > Key: ARROW-2079 > URL: https://issues.apache.org/jira/browse/ARROW-2079 > Project: Apache Arrow > Issue Type: Improvement > Components: Python > Reporter: Jim Crist > Priority: Minor > > Currently pyarrow's parquet writer only writes `_common_metadata` and not > `_metadata`. From what I understand these are intended to contain the dataset > schema but not any row group information. > > A few (possibly naive) questions: > > 1. In the `__init__` for `ParquetDataset`, the following lines exist: > {code:java} > if self.metadata_path is not None: > with self.fs.open(self.metadata_path) as f: > self.common_metadata = ParquetFile(f).metadata > else: > self.common_metadata = None > {code} > I believe this should use `common_metadata_path` instead of `metadata_path`, > as the latter is never written by `pyarrow`, and is given by the `_metadata` > file instead of `_common_metadata` (as seemingly intended?). > > 2. In `validate_schemas` I believe an option should exist for using the > schema from `_common_metadata` instead of `_metadata`, as pyarrow currently > only writes the former, and as far as I can tell `_common_metadata` does > include all the schema information needed. > > Perhaps the logic in `validate_schemas` could be ported over to: > > {code:java} > if self.schema is not None: > pass # schema explicitly provided > elif self.metadata is not None: > self.schema = self.metadata.schema > elif self.common_metadata is not None: > self.schema = self.common_metadata.schema > else: > self.schema = self.pieces[0].get_metadata(open_file).schema{code} > If these changes are valid, I'd be happy to submit a PR. It's not 100% clear > to me the difference between `_common_metadata` and `_metadata`, but I > believe the schema in both should be the same. Figured I'd open this for > discussion. -- This message was sent by Atlassian JIRA (v7.6.3#76005)