[ https://issues.apache.org/jira/browse/ARROW-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neal Richardson updated ARROW-11465: ------------------------------------ Summary: [C++] Parquet file writer snapshot API and proper ColumnChunk.file_path utilization (was: Parquet file writer snapshot API and proper ColumnChunk.file_path utilization) > [C++] Parquet file writer snapshot API and proper ColumnChunk.file_path > utilization > ----------------------------------------------------------------------------------- > > Key: ARROW-11465 > URL: https://issues.apache.org/jira/browse/ARROW-11465 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Affects Versions: 3.0.0 > Reporter: Radu Teodorescu > Assignee: Radu Teodorescu > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This is a follow up to the thread: > [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccdd00783-0ffc-4934-aa24-529fb2a44...@yahoo.com%3e] > The specific use case I am targeting is having the ability to partially read > a parquet file while it's still being written to. > This is relevant for any process that is recording events over a long period > of times and writing them to parquet (tracing data, logging events or any > other live time series) > The solution relies on the fact that parquet specifications allows column > chunk metadata to point explicitly to its location in a file which can > theoretically be different from the file containing the metadata (as covered > in other threads, this behavior is not fully supported by major parquet > implementations). > My solution is centered around adding a method, > > {{void ParquetFileWriter::Snapshot(const std::string& data_path, > std::shared_ptr<::arrow::io::OutputStream>& > sink) }} > ,that writes writes the metadata for the RowGroups given so far to the > {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to > point to {{data_path}}. This was intended as a minimalist change to > {{ParquetFileWriter}} > On the reading side I implemented full support for ColumnChunk.file_path by > introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} > in the {{ParquetFileReader}} implementation stack. In the PR implementation > one can default to the current behavior by using the {{SingleFile}} class, > have full read support for multi-file parquet in line with the specification > by using {{MultiReadableFile}} implementation (that captures the metafile > base directory and uses it as the base directory to the ColumChunk.file_path) > or one can provide a separate implementation for a non-posix file system > storage. > For an example see {{write_parquet_file_with_snapshot}} function in > reader-writer.cc that illustrates the snapshotting write while the > {{read_whole_file}} function has been modified to read one of the snapshots > (I will rollback that change and provide separate example before the merge) -- This message was sent by Atlassian Jira (v8.3.4#803005)