[jira] [Commented] (PARQUET-1559) Add way to manually commit already written data to disk

Gabor Szadovszky (JIRA) Tue, 16 Apr 2019 01:44:41 -0700


    [ 
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818783#comment-16818783
 ]


Gabor Szadovszky commented on PARQUET-1559:
-------------------------------------------

A simple flush would not make the parquet file readable. The footer needs to be 
written and the file closed to make it available for reading. You may close the 
current file by closing the ParquetWriter any time and open a new one. But it 
is worth to mention that parquet format can benefit from its statistics and 
encoding if you write large amount of data into one file (at least one hdfs 
block).
If you cannot write more data into one file I would suggest using avro instead. 
Avro can be used the way you've described. Then, you might do a cleanup from 
time to time by re-reading the data from the small avro files and create large 
parquet files.

> Add way to manually commit already written data to disk
> -------------------------------------------------------
>
>                 Key: PARQUET-1559
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1559
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Victor
>            Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have 
> the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running 
> process
>  * I would like to be able from time to time to access the already written 
> data
> So I was expecting to be able to flush manually the file to ensure the data 
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is 
> something about metadata being at the footer of the file), what would then be 
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write 
> multiple files in that case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (PARQUET-1559) Add way to manually commit already written data to disk

Reply via email to