[ 
https://issues.apache.org/jira/browse/PARQUET-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reza Shiftehfar updated PARQUET-209:
------------------------------------
    Description: 
While using ParquetWriter and before closing it to write the content out to the 
disk, there is no way to check/estimate the size of the output file. This is 
useful in case we want to close files and upload them based on a minimum size 
threshold. Since ParquetWriter keeps everything in memory and only writes it 
out to disk at the very end when writer is closed, it is not possible to have 
an estimate of the output file size before closing the writer.

Based on Parquet documentation, the data is written into memory object in the 
final format, meaning that the size of the object in memory is the very close 
to the final size on disk. it would be great if you can expose the current size 
of the parquetWriter object in memory. It is true that such a size will be 
different than the final output size because of adding the schema and other 
metadata at the end of the file but it still gives a close estimation of the 
output file size that will be very useful when reading/writing streams.

  was:
While using ParquetWriter and before closing the writer to write content out to 
the disk, there is no way to check/estimate the size of the output file. This 
is useful in case we want to close files and upload them based on a size 
threshold. 
Since ParquetWriter keeps everything in memory and only writes it out to disk 
at the end when writer is closed, it is not possible to have an estimate of the 
file size before closing the writer.

Based on Parquet documentation, the data is written into memory object in the 
final format, meaning that the size of the object in memory is the same as the 
final size on disk. it would be great if you can expose the current size. 

It is true that such a size will be different than the final output size 
because of adding the schema and other metadata at the end of the file but it 
still gives a close estimation of the output file size that will be very useful 
when reading/writing streams.


> Enhance ParquetWriter with exposing in-memory size of writer object
> -------------------------------------------------------------------
>
>                 Key: PARQUET-209
>                 URL: https://issues.apache.org/jira/browse/PARQUET-209
>             Project: Parquet
>          Issue Type: Wish
>            Reporter: Reza Shiftehfar
>              Labels: parquetWriter
>
> While using ParquetWriter and before closing it to write the content out to 
> the disk, there is no way to check/estimate the size of the output file. This 
> is useful in case we want to close files and upload them based on a minimum 
> size threshold. Since ParquetWriter keeps everything in memory and only 
> writes it out to disk at the very end when writer is closed, it is not 
> possible to have an estimate of the output file size before closing the 
> writer.
> Based on Parquet documentation, the data is written into memory object in the 
> final format, meaning that the size of the object in memory is the very close 
> to the final size on disk. it would be great if you can expose the current 
> size of the parquetWriter object in memory. It is true that such a size will 
> be different than the final output size because of adding the schema and 
> other metadata at the end of the file but it still gives a close estimation 
> of the output file size that will be very useful when reading/writing streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to