[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826178#comment-16826178 ]
Martin Durant commented on ARROW-1983: -------------------------------------- I don't know about deprecated, and I wouldn't hold my breath over a Parquet 2.0... the metadata file is very useful in the context of Dask, so that we can know what files exist in the dataset without doing a glob, and we can do simple filtering on max/min values without having to touch every file, which can be very slow on remote storage. The spec certainly allows for a metadata file even if it could be seen as more of a convention from Hive: a column chunk allows for a file_path attribute, indicating that the data itself is in another file (which, actually, doesn't need to be a proper parquet file, but in practice always is). You could indeed construct the metadata by repeatedly appending to a file, since you will know that everything in the file is the thrift footer block except for the PAR1 bytes. However, you will not preserve row-group order in general, which might be important, and writes had better be atomic. Furthermore, remote storage will usually *not allow* small appends from multiple processes in this way. > [Python] Add ability to write parquet `_metadata` file > ------------------------------------------------------ > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python > Reporter: Jim Crist > Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)