[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16841417#comment-16841417
 ] 

Joris Van den Bossche commented on ARROW-1983:
----------------------------------------------

Copying the questions that [~pearu] mentioned above / asked on github here:

{quote}2. After collecting all file metadata of dataset pieces (possibly from 
different dask processes), what would be desired the interface for writing and 
reading the dataset metadata instances? Here's an initial proposal for Python:

{code:python}
def write_dataset_metadata(metadata_list, where):
    ...

def read_dataset_metadata(where, memory_map=False):
    ...
    return metadata_list
{code}

3. Clearly, each file metadata instance in `metadata_list` contains same 
information as in all metadata instances. Would it make sense and use to 
introduce a new `DatasetMetadata` class (in C++, also exposed to Python) that 
would collect the metadata information of dataset pieces in a more compact way 
as well as would provide I/O methods? (This also addresses the question 
2.){quote}



> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet, pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to