[ 
https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826171#comment-16826171
 ] 

Martin Durant commented on ARROW-1983:
--------------------------------------

I don't know about "deprecated" (or whether a Parquet 2.0 is really a thing), 
but...
It is definitely useful for the Dask case of wanting to know what files exist 
and the max/mins for basic filtering, without having to touch all the files 
before makign that decision.

The metadata file was indeed taken on by Hive, and could be seen as a 
convention, but it is certainly allowed by the spec since the column_chunk 
structure allows for a path, and so for a row-group to have its data in another 
file. You can indeed just append to a file as you go, since in the case that 
there is no actual data, you know exactly how much space is taken up before the 
start of the thrift block (4 bytes: PAR1). However, the writes must be atomic, 
and will in general *not work* for remote storage, where small appends don't 
happen as you might assume.

> [Python] Add ability to write parquet `_metadata` file
> ------------------------------------------------------
>
>                 Key: ARROW-1983
>                 URL: https://issues.apache.org/jira/browse/ARROW-1983
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, Python
>            Reporter: Jim Crist
>            Priority: Major
>              Labels: beginner, parquet, pull-request-available
>             Fix For: 0.14.0
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file 
> (mostly just schema information). It would be useful to add the ability to 
> write a {{_metadata}} file as well. This should include information about 
> each row group in the dataset, including summary statistics. Having this 
> summary file would allow filtering of row groups without needing to access 
> each file beforehand.
> This would require that the user is able to get the written RowGroups out of 
> a {{pyarrow.parquet.write_table}} call and then give these objects as a list 
> to new function that then passes them on as C++ objects to {{parquet-cpp}} 
> that generates the respective {{_metadata}} file.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to