[ 
https://issues.apache.org/jira/browse/ARROW-11465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neal Richardson updated ARROW-11465:
------------------------------------
    Summary: [C++] Parquet file writer snapshot API and proper 
ColumnChunk.file_path utilization  (was: Parquet file writer snapshot API and 
proper ColumnChunk.file_path utilization)

> [C++] Parquet file writer snapshot API and proper ColumnChunk.file_path 
> utilization
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-11465
>                 URL: https://issues.apache.org/jira/browse/ARROW-11465
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 3.0.0
>            Reporter: Radu Teodorescu
>            Assignee: Radu Teodorescu
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> This is a follow up to the thread:
> [https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3ccdd00783-0ffc-4934-aa24-529fb2a44...@yahoo.com%3e]
> The specific use case I am targeting is having the ability to partially read 
> a parquet file while it's still being written to.
> This is relevant for any process that is recording events over a long period 
> of times and writing them to parquet (tracing data, logging events or any 
> other live time series)
> The solution relies on the fact that parquet specifications allows column 
> chunk metadata to point explicitly to its location in a file which can 
> theoretically be different from the file containing the metadata (as covered 
> in other threads, this behavior is not fully supported by major parquet 
> implementations).
> My solution is centered around adding a method,
>  
> {{void ParquetFileWriter::Snapshot(const std::string& data_path,
>                                  std::shared_ptr<::arrow::io::OutputStream>& 
> sink) }}
> ,that writes writes the metadata for the RowGroups given so far to the 
> {{sink}} stream and updates all the ColumnChunk metadata {{file_path}} to 
> point to {{data_path}}. This was intended as a minimalist change to 
> {{ParquetFileWriter}}
> On the reading side I implemented full support for ColumnChunk.file_path by 
> introducing {{ArrowMultiInputFile}} as an alternative to {{ArrowInputFile}} 
> in the {{ParquetFileReader}} implementation stack. In the PR implementation 
> one can default to the current behavior by using the {{SingleFile}} class, 
> have full read support for multi-file parquet in line with the specification 
> by using {{MultiReadableFile}} implementation (that captures the metafile 
> base directory and uses it as the base directory to the ColumChunk.file_path) 
> or one can provide a separate implementation for a non-posix file system 
> storage.
> For an example see {{write_parquet_file_with_snapshot}} function in 
> reader-writer.cc that illustrates the snapshotting write while the 
> {{read_whole_file}} function has been modified to read one of the snapshots 
> (I will rollback that change and provide separate example before the merge)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to