[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16820090#comment-16820090
 ] 

Martin Durant commented on ARROW-1983:
--------------------------------------

> If readers would be able to use metadata from a separate file (not sure if 
> parquet format would allow it), duplicating metadata storage in both 
> approaches could be avoided.

Yes, this is exactly what should happen: the metadata file contains all of the 
metadata (with paths pointing to the actual data files) and the data files 
contain the metadata with only their own row-groups specified. Each row group 
def appears thus in two places, and the schema is repeated many times. This 
duplication is desirable to allow viewing the data as a complete set, or each 
file independently.

If reading via the separate metadata file, the reader does not need to touch 
the footers of the data files at all, since it already has everything it needs.

> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet
>             Fix For: 0.14.0
>
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to