[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

Martin Durant (JIRA) Thu, 25 Apr 2019 08:39:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826178#comment-16826178
 ]


Martin Durant commented on ARROW-1983:
--------------------------------------

I don't know about deprecated, and I wouldn't hold my breath over a Parquet 
2.0... the metadata file is very useful in the context of Dask, so that we can 
know what files exist in the dataset without doing a glob, and we can do simple 
filtering on max/min values without having to touch every file, which can be 
very slow on remote storage.

The spec certainly allows for a metadata file even if it could be seen as more 
of a convention from Hive: a column chunk allows for a file_path attribute, 
indicating that the data itself is in another file (which, actually, doesn't 
need to be a proper parquet file, but in practice always is).

You could indeed construct the metadata by repeatedly appending to a file, 
since you will know that everything in the file is the thrift footer block 
except for the PAR1 bytes. However, you will not preserve row-group order in 
general, which might be important, and writes had better be atomic. 
Furthermore, remote storage will usually *not allow* small appends from 
multiple processes in this way.

> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet, pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file

Reply via email to