[ 
https://issues.apache.org/jira/browse/ARROW-2079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16354594#comment-16354594
 ] 

Uwe L. Korn commented on ARROW-2079:
------------------------------------

1) Yes this should be {{common_metadata_path}} but we could also use here 
{{metadata_path}} is {{common_metadata_path}} is None.

2) yes, this also looks fine.

 

{{_common_metadata}} is simply a Parquet file that contains the schema of all 
files that are part of a dataset. {{_metadata}} is the same but it additionally 
contains the metadata of each RowGroup that is part of the dataset, i.e. 
{{_metadata}} is a superset from {{_common_metadata}}.

> Possibly use `_common_metadata` for schema if `_metadata` isn't available
> -------------------------------------------------------------------------
>
>                 Key: ARROW-2079
>                 URL: https://issues.apache.org/jira/browse/ARROW-2079
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Jim Crist
>            Priority: Minor
>
> Currently pyarrow's parquet writer only writes `_common_metadata` and not 
> `_metadata`. From what I understand these are intended to contain the dataset 
> schema but not any row group information.
>  
> A few (possibly naive) questions:
>  
> 1. In the `__init__` for `ParquetDataset`, the following lines exist:
> {code:java}
> if self.metadata_path is not None:
>     with self.fs.open(self.metadata_path) as f:
>         self.common_metadata = ParquetFile(f).metadata
> else:
>     self.common_metadata = None
> {code}
> I believe this should use `common_metadata_path` instead of `metadata_path`, 
> as the latter is never written by `pyarrow`, and is given by the `_metadata` 
> file instead of `_common_metadata` (as seemingly intended?).
>  
> 2. In `validate_schemas` I believe an option should exist for using the 
> schema from `_common_metadata` instead of `_metadata`, as pyarrow currently 
> only writes the former, and as far as I can tell `_common_metadata` does 
> include all the schema information needed.
>  
> Perhaps the logic in `validate_schemas` could be ported over to:
>  
> {code:java}
> if self.schema is not None:
>     pass  # schema explicitly provided
> elif self.metadata is not None:
>     self.schema = self.metadata.schema
> elif self.common_metadata is not None:
>     self.schema = self.common_metadata.schema
> else:
>     self.schema = self.pieces[0].get_metadata(open_file).schema{code}
> If these changes are valid, I'd be happy to submit a PR. It's not 100% clear 
> to me the difference between `_common_metadata` and `_metadata`, but I 
> believe the schema in both should be the same. Figured I'd open this for 
> discussion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to