[ https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16818783#comment-16818783 ]
Gabor Szadovszky commented on PARQUET-1559: ------------------------------------------- A simple flush would not make the parquet file readable. The footer needs to be written and the file closed to make it available for reading. You may close the current file by closing the ParquetWriter any time and open a new one. But it is worth to mention that parquet format can benefit from its statistics and encoding if you write large amount of data into one file (at least one hdfs block). If you cannot write more data into one file I would suggest using avro instead. Avro can be used the way you've described. Then, you might do a cleanup from time to time by re-reading the data from the small avro files and create large parquet files. > Add way to manually commit already written data to disk > ------------------------------------------------------- > > Key: PARQUET-1559 > URL: https://issues.apache.org/jira/browse/PARQUET-1559 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Affects Versions: 1.10.1 > Reporter: Victor > Priority: Major > > I'm not exactly sure this is compliant with the way parquet works, but I have > the following need: > * I'm using parquet-avro to write to a parquet file during a long running > process > * I would like to be able from time to time to access the already written > data > So I was expecting to be able to flush manually the file to ensure the data > is on disk and then copy the file for preliminary analysis. > If it's contradictory to the way parquet works (for example there is > something about metadata being at the footer of the file), what would then be > the alternative? > Closing the file and opening a new one to continue writing? > Could this be supported directly by parquet-mr maybe? It would then write > multiple files in that case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)