[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853433#comment-16853433 ] Wes McKinney commented on ARROW-1983: - Yes. I don't think it necessarily to resolve all of this in a single patch, so we can open a follow-up JIRA to implement the optimization to read a row group given a _metadata file. There is some other complexity there such as how to open the filepath (you need a FileSystem handle -- see the filesystem API work that is in process) > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853430#comment-16853430 ] Rick Zamora commented on ARROW-1983: Right, I see what you are saying. You can pass in a list of files to pq.ParquetDataset (obtained by calling read_metadata on the metadata file), but the footer metadata will be unecessarily parsed a second time. For dask, this is probably not much of an issue, because each worker will only be dealing with a subset of the global dataset. In many other cases this is clearly undesireable. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853383#comment-16853383 ] Wes McKinney commented on ARROW-1983: - Well, one issue is how to use the _metadata file to read data from the files it lists within without having to parse those files' respective metadata again. I think this may require a little bit of refactoring in the Parquet C++ library > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853378#comment-16853378 ] Rick Zamora commented on ARROW-1983: I submitted a PR to perform the metadata aggregation and metadata-only file write ([https://github.com/apache/arrow/pull/4405]). I just syncronized with the master branch, so hopefully I can address any suggestions/concerns people have relatively quickly. Are there any additional features that we need for "utilizing" the metadata file within arrow.parque itself? I believe the existing read_metadata function should be sufficient for the needs of dask. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6.5h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849187#comment-16849187 ] Joris Van den Bossche commented on ARROW-1983: -- I think so yes (at least when reading, it returns a single FileMetadata instance with all row groups). Besides the "append" operation, we also need a "write" method for such FileMetadata instance (I suppose this only needs some work on the python/cython side, since this is just writing a parquet file without actual data, although didn't check C++). There is currently a {{write_metadata}}, but that requires an *arrow* schema, and not a *parquet* schema. Regarding the public API, I suppose we can modify {{write_metadata}} to also accept a parquet schema, to not have to add an extra function. That will need some changes under the hood in {{ParquetWriter}} to be able to accept a given FileMetadata object. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849176#comment-16849176 ] Wes McKinney commented on ARROW-1983: - I see, that was not clear to me. So what we actually need is a {{FileMetaData::AppendRowGroups}} method (that merges one metadata into another), is that correct? > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849118#comment-16849118 ] Joris Van den Bossche commented on ARROW-1983: -- {quote}Correspondingly, please also write a function that parses a multiple-metadata file like{quote} We already have {{read_metadata}} (on the python side, it actually uses the normal parquet file reading), that does read such {{_metadata}} files. Are you still looking for something else? To my understanding, it is not really a "multiple-metadata file", but a file with a single FileMetadata where the row groups of all metadata objects are combined. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849106#comment-16849106 ] Wes McKinney commented on ARROW-1983: - Probably the most flexible thing for writing would be a function that appends a metadata to an OutputStream {code:java} Status AppendFileMetaData(const FileMetaData& metadata, arrow::io::OutputStream* out); {code} Correspondingly, please also write a function that parses a multiple-metadata file like {code} Status ParseMetaDataFile(arrow::io::InputStream* input, std::vector>* out); {code} Lastly, AFAIK we are not able to instantiate {{ParquetFileReader}} given previously-read {{FileMetaData}} -- probably can do that in a separate JIRA > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 6h 20m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841668#comment-16841668 ] Joris Van den Bossche commented on ARROW-1983: -- > We also need to make sure that the file path is being set in the metadata, > otherwise the {{_metadata}} file is useless Indeed, I opened ARROW-5349 for that earlier today. I assume this is separate from the {{WriteMultipleMetadata}} you mention above, as it is the user (eg Dask) who knows where the files that correspond to the different metadata objects are put? > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841625#comment-16841625 ] Wes McKinney commented on ARROW-1983: - Right. This is a relatively straightforward C++ function to write – Pearu actually already had partially implemented it in one of the patch iterations. The API would be something like {code:java} Status WriteMultipleMetadata(const std::vector>& metadatas, arrow::io::OutputStream* out); {code} Does someone want to write it (I mean, I can do it, but it would be good for other people to get some experience with the Parquet codebase)? We also need to make sure that the file path is being set in the metadata, otherwise the {{_metadata}} file is useless > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841424#comment-16841424 ] Joris Van den Bossche commented on ARROW-1983: -- > what would be desired the interface for writing and reading the dataset > metadata instances? For reading, I think the existing {{parquet.read_metadata}} is sufficient? (that already works with such metadata files generated by other libraries). For writing, the current {{parquet.write_metadata}} expects a pyarrow schema, so for that we indeed need a new method, or adapt the existing one to also accept a parquet FileMetaData object. But for writing, I think the main part that is missing is a way to combined the list of metadata objects into a single FileMetaData object? > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841417#comment-16841417 ] Joris Van den Bossche commented on ARROW-1983: -- Copying the questions that [~pearu] mentioned above / asked on github here: {quote}2. After collecting all file metadata of dataset pieces (possibly from different dask processes), what would be desired the interface for writing and reading the dataset metadata instances? Here's an initial proposal for Python: {code:python} def write_dataset_metadata(metadata_list, where): ... def read_dataset_metadata(where, memory_map=False): ... return metadata_list {code} 3. Clearly, each file metadata instance in `metadata_list` contains same information as in all metadata instances. Would it make sense and use to introduce a new `DatasetMetadata` class (in C++, also exposed to Python) that would collect the metadata information of dataset pieces in a more compact way as well as would provide I/O methods? (This also addresses the question 2.){quote} > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833141#comment-16833141 ] Pearu Peterson commented on ARROW-1983: --- For this issue, questions raised in https://github.com/apache/arrow/pull/4236#issuecomment-489017226 needs to be addressed. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833140#comment-16833140 ] Pearu Peterson commented on ARROW-1983: --- ARROW-5258 provides a way to collect file metadata objects created by write_to_dataset function. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 50m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831195#comment-16831195 ] Pearu Peterson commented on ARROW-1983: --- Arrow [PR 4236|https://github.com/apache/arrow/pull/4236] implements another approach for collecting the file metadata instances of dataset pieces: `write_to_dataset` can take a `metadata_collector` kw argument that list value will be filled with the file metadata instances. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826179#comment-16826179 ] Martin Durant commented on ARROW-1983: -- I don't know about deprecated, and I wouldn't hold my breath over a Parquet 2.0... the metadata file is very useful in the context of Dask, so that we can know what files exist in the dataset without doing a glob, and we can do simple filtering on max/min values without having to touch every file, which can be very slow on remote storage. The spec certainly allows for a metadata file even if it could be seen as more of a convention from Hive: a column chunk allows for a file_path attribute, indicating that the data itself is in another file (which, actually, doesn't need to be a proper parquet file, but in practice always is). You could indeed construct the metadata by repeatedly appending to a file, since you will know that everything in the file is the thrift footer block except for the PAR1 bytes. However, you will not preserve row-group order in general, which might be important, and writes had better be atomic. Furthermore, remote storage will usually *not allow* small appends from multiple processes in this way. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826178#comment-16826178 ] Martin Durant commented on ARROW-1983: -- I don't know about deprecated, and I wouldn't hold my breath over a Parquet 2.0... the metadata file is very useful in the context of Dask, so that we can know what files exist in the dataset without doing a glob, and we can do simple filtering on max/min values without having to touch every file, which can be very slow on remote storage. The spec certainly allows for a metadata file even if it could be seen as more of a convention from Hive: a column chunk allows for a file_path attribute, indicating that the data itself is in another file (which, actually, doesn't need to be a proper parquet file, but in practice always is). You could indeed construct the metadata by repeatedly appending to a file, since you will know that everything in the file is the thrift footer block except for the PAR1 bytes. However, you will not preserve row-group order in general, which might be important, and writes had better be atomic. Furthermore, remote storage will usually *not allow* small appends from multiple processes in this way. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826171#comment-16826171 ] Martin Durant commented on ARROW-1983: -- I don't know about "deprecated" (or whether a Parquet 2.0 is really a thing), but... It is definitely useful for the Dask case of wanting to know what files exist and the max/mins for basic filtering, without having to touch all the files before makign that decision. The metadata file was indeed taken on by Hive, and could be seen as a convention, but it is certainly allowed by the spec since the column_chunk structure allows for a path, and so for a row-group to have its data in another file. You can indeed just append to a file as you go, since in the case that there is no actual data, you know exactly how much space is taken up before the start of the thrift block (4 bytes: PAR1). However, the writes must be atomic, and will in general *not work* for remote storage, where small appends don't happen as you might assume. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16823112#comment-16823112 ] Wes McKinney commented on ARROW-1983: - I'm just returning from vacation and catching up on e-mail etc., this effort is a priority for me so I will review the discussion and PR and give feedback as soon as I can > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 20m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820504#comment-16820504 ] Pearu Peterson commented on ARROW-1983: --- Arrow [PR 4166|https://github.com/apache/arrow/pull/4166] implements the approach 1 above. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet, pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820090#comment-16820090 ] Martin Durant commented on ARROW-1983: -- > If readers would be able to use metadata from a separate file (not sure if > parquet format would allow it), duplicating metadata storage in both > approaches could be avoided. Yes, this is exactly what should happen: the metadata file contains all of the metadata (with paths pointing to the actual data files) and the data files contain the metadata with only their own row-groups specified. Each row group def appears thus in two places, and the schema is repeated many times. This duplication is desirable to allow viewing the data as a complete set, or each file independently. If reading via the separate metadata file, the reader does not need to touch the footers of the data files at all, since it already has everything it needs. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16820084#comment-16820084 ] Pearu Peterson commented on ARROW-1983: --- There seems to be two options to write a separate metadata file from Arrow: # Following Wes comment "On the C++ side we would expose an API to append row group metadata into a common file.", introduce the second sink argument to ParquetFileWriter::Open that will be used for collecting FileMetaData content during the writing. [~wesmckinn], can you confirm that this would be the right approach? # Introduce a flag to ParquetFileWriter that when enabled will cause skipping all data writes and would write only FileMetaData content. As a result, one would need to call the dataset write twice, one for writing data (and metadata) as currently, and the second time for writing metadata-only (writes would be collected to a single file). Comparing the two approaches, the approach 2 is simpler but suboptimal as the writing process is executed twice. In both cases, the metadata would have duplicated storage (in data files as currently, and in the separate metadata file). If readers would be able to use metadata from a separate file (not sure if parquet format would allow it), duplicating metadata storage in both approaches could be avoided. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818370#comment-16818370 ] Martin Durant commented on ARROW-1983: -- > Note that the Parquet format has three different metadata structures No, this is incorrect, unfortunately the tern "metadata" is used with multiple meanings here. - All parquet files contain FileMetaData in the file footer, which may include one or more key-value pairs, and includes other important things like the schema - If the file contains any row-groups or references to row-groups in other files, it will also contain ColumnMetaData (and possible more key-value pairs); this is all *within* the FileMetaData structure - the special file `_metadata` may exist, which contains *only* FileMetaData, and any row-groups have only links to other files and no data within the file. - the special file `_common_metadata` may exist, which also only contains a FileMetaData structure, but has no row group components at all. - ordinary data files should have the same common metadata (schema, key-values), so you can load any one of them, but they contain only the row-groups of that one file and no links to any others. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818343#comment-16818343 ] Pearu Peterson commented on ARROW-1983: --- Note that the Parquet format has three different metadata structures, see [https://github.com/apache/parquet-format#metadata] . The "_metadata" corresponds to `FileMetaData.key_value_metadata` (in parquet-format specification) while the "statistics" (that is of interest of Dask, if I understand it correctly) corresponds to `ColumnMetadata.key_value_metadata`. Yes, Arrow can read all this information and more. My basic questions are: # What information needs to be collected? Note that some information is internal to parquet files that one would never need, hence it would just a waste of space to collect it, especially when the Datasets become huge (as would be expected in Dask applications). # Where this information should be gathered for easy and efficient access? > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818334#comment-16818334 ] Martin Durant commented on ARROW-1983: -- A convention, yes, but not in the parquet standard as such. I believe spark might have started this, although they producing them is now optional. You typically have `_metadata`, containing schema, references to all the row-groups and information about them such as column statistics, and `_common_metadata`, which contains only the schema (and is therefore much smaller). > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16818319#comment-16818319 ] Matthew Rocklin commented on ARROW-1983: My understanding is that there is already a standard around using a "_metadata" file that presumably is expected to have certain data laid out in a certain way. It may be that [~mdurant] can provide a nice reference to the expectations. It also looks like PyArrow has a nice reader for this information. If I open up a Parquet Dataset that has a `_metadata` file I find that my object has all of the right information, so that might also be a good place to look. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16817308#comment-16817308 ] Pearu Peterson commented on ARROW-1983: --- Currently, ParquetDataset metadata has the following approximate data structure (type-specs are shown only for the relevant attributes): {noformat} ParquetDataset: list pieces list paths fs common_metadata, common_metadata_path metadata, metadata_path ParquetDatasetPiece: sting path get_metadata() -> FileMetaData partition_keys FileMetaData: list row_groups ParquetSchema schema dict metadata = {b‘pandas’: } int num_rows, num_columns str format_version, created_by RowGroupMetaData: list columns int num_rows, total_byte_size ColumnChunkMetaData: str physical_type, encodings, path_in_schema, compression int num_values, total_uncompressed_size, total_compressed_size, data_page_offset, index_page_offset, dictionary_page_offset RowGroupStatistics statistics RowGroupStatistics: bool has_min_max int min, max, null_count, distinct_count, num_values str physical_type{noformat} If only the data in RowGroupStatistics is relevant for this issue (please confirm), then the statistics data could be collected into a single Parquet file, say `_statistics`, containing the following columns: {noformat} , , , {noformat} [~mrocklin], would the information in `_statistics` sufficient for Dask needs? > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792681#comment-16792681 ] Wes McKinney commented on ARROW-1983: - This will have to get done for 0.14. We're basically out of time for 0.13 and only fixing bugs now > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robbie Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.14.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16787367#comment-16787367 ] Wes McKinney commented on ARROW-1983: - No timeline. I'm not sure who is going to do the work; I will not be able to in time for 0.13 > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robbie Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782163#comment-16782163 ] Matthew Rocklin commented on ARROW-1983: Hi all, thought I would check in here. I'll likely start planning work around Dask Parquet reader/writer functionality soon, and am curious is there is any timeline on this issue. "Nope" is a totally fine answer, just looking for information for planning purposes. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robbie Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16751500#comment-16751500 ] Matthew Rocklin commented on ARROW-1983: In https://github.com/dask/dask/issues/4410 we learn that metadata information can grow to be large in the case where there are many columns and many partitions. There is some value to ensuring that the metadata results are somewhat compact in memory, though I also wouldn't spend a ton of effort optimizing here. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731472#comment-16731472 ] Wes McKinney commented on ARROW-1983: - Velocity on these things should pick up in 2019 since the Ursa Labs team is growing. The "Arrow Dataset" project extends beyond Parquet (where Parquet is one storage format). Ideally this work will happen in Q1 2019. Handling the "_metadata" file is lower hanging fruit so that can likely get done a lot sooner > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731467#comment-16731467 ] Matthew Rocklin commented on ARROW-1983: > I'm planning to move more of the multifile dataset handling into C++ because > we also need it in Ruby and R, so would make sense to maintain one > implementation for the 3 languages Makes sense to me. No pressure, but is there a time in particular when you're planning to do this? This will help me with planning on the Dask side. I'm also happy to help with things on the Python Arrow side near term if they come up. For context see https://github.com/dask/dask/pull/4336#issuecomment-450686100 > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731464#comment-16731464 ] Wes McKinney commented on ARROW-1983: - Mechanically this isn't a huge change. On the C++ side we would expose an API to append row group metadata into a common file. This can be used from the Python side, then. I'm planning to move more of the multifile dataset handling into C++ because we also need it in Ruby and R, so would make sense to maintain one implementation for the 3 languages > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730844#comment-16730844 ] Matthew Rocklin commented on ARROW-1983: > If I understand correctly, we need to combine all of the row group metadata > for all files in a directory. Yes. Ideally when writing a row group we would get some metadata object in memory. We would then collect all of those objects and hand them to some `write_metadata` function afterwards. > When a new file is written, does this file have to be updated? Yes, or it can be removed/invalidated. As a side note, this is probably one of a small number of issues that stop Dask Dataframe from using PyArrow by default. Metadata files with full row group information are especially valuable for us, particularly with remote/cloud storage. (I'm going through Dask's parquet handling now) > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16687186#comment-16687186 ] Wes McKinney commented on ARROW-1983: - If I understand correctly, we need to combine all of the row group metadata for all files in a directory. When a new file is written, does this file have to be updated? > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.13.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16633762#comment-16633762 ] Wes McKinney commented on ARROW-1983: - More work is needed here it seems. Moving to 0.12 > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Assignee: Robert Gruener >Priority: Major > Labels: beginner, parquet > Fix For: 0.12.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541803#comment-16541803 ] Robert Gruener commented on ARROW-1983: --- [~xhochy] I made this dependent task PARQUET-1348 > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.11.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-1983) [Python] Add ability to write parquet `_metadata` file
[ https://issues.apache.org/jira/browse/ARROW-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16540231#comment-16540231 ] Robert Gruener commented on ARROW-1983: --- This looks like it would need changes in parquet-cpp as the [arrow writer only takes a Schema|https://github.com/apache/parquet-cpp/blob/master/src/parquet/arrow/writer.h#L116] and not the FileMetaData object which contains the row group information. > [Python] Add ability to write parquet `_metadata` file > -- > > Key: ARROW-1983 > URL: https://issues.apache.org/jira/browse/ARROW-1983 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Jim Crist >Priority: Major > Labels: beginner, parquet > Fix For: 0.11.0 > > > Currently {{pyarrow.parquet}} can only write the {{_common_metadata}} file > (mostly just schema information). It would be useful to add the ability to > write a {{_metadata}} file as well. This should include information about > each row group in the dataset, including summary statistics. Having this > summary file would allow filtering of row groups without needing to access > each file beforehand. > This would require that the user is able to get the written RowGroups out of > a {{pyarrow.parquet.write_table}} call and then give these objects as a list > to new function that then passes them on as C++ objects to {{parquet-cpp}} > that generates the respective {{_metadata}} file. -- This message was sent by Atlassian JIRA (v7.6.3#76005)