[ 
https://issues.apache.org/jira/browse/PARQUET-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17171442#comment-17171442
 ] 

Gabor Szadovszky commented on PARQUET-1559:
-------------------------------------------

[~wxmimperio], as I've said, parquet need to keep all data in memory before 
writing the whole row group at once. You cannot do it any other way because of 
the structure of the parquet file. You can configure paquet-mr to write smaller 
row groups but it can hurt performance at reading. The other way is to use 
other file formats like avro where you can flush even record-by-record if 
needed. 

> Add way to manually commit already written data to disk
> -------------------------------------------------------
>
>                 Key: PARQUET-1559
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1559
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Victor
>            Priority: Major
>
> I'm not exactly sure this is compliant with the way parquet works, but I have 
> the following need:
>  * I'm using parquet-avro to write to a parquet file during a long running 
> process
>  * I would like to be able from time to time to access the already written 
> data
> So I was expecting to be able to flush manually the file to ensure the data 
> is on disk and then copy the file for preliminary analysis.
> If it's contradictory to the way parquet works (for example there is 
> something about metadata being at the footer of the file), what would then be 
> the alternative?
> Closing the file and opening a new one to continue writing?
> Could this be supported directly by parquet-mr maybe? It would then write 
> multiple files in that case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to