[
https://issues.apache.org/jira/browse/PARQUET-209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Reza Shiftehfar updated PARQUET-209:
------------------------------------
Description:
While using ParquetWriter and before closing it to write the content out to the
disk, there is no way to check/estimate the size of the output file. This is
useful in case we want to close files and upload them based on a minimum size
threshold. Since ParquetWriter keeps everything in memory and only writes it
out to disk at the very end when writer is closed, it is not possible to have
an estimate of the output file size before closing the writer.
Based on Parquet documentation, the data is written into memory object in the
final format, meaning that the size of the object in memory is the very close
to the final size on disk. it would be great if you can expose the current size
of the parquetWriter object in memory. It is true that such a size will be
different than the final output size because of adding the schema and other
metadata at the end of the file but it still gives a close estimation of the
output file size that will be very useful when reading/writing streams.
was:
While using ParquetWriter and before closing the writer to write content out to
the disk, there is no way to check/estimate the size of the output file. This
is useful in case we want to close files and upload them based on a size
threshold.
Since ParquetWriter keeps everything in memory and only writes it out to disk
at the end when writer is closed, it is not possible to have an estimate of the
file size before closing the writer.
Based on Parquet documentation, the data is written into memory object in the
final format, meaning that the size of the object in memory is the same as the
final size on disk. it would be great if you can expose the current size.
It is true that such a size will be different than the final output size
because of adding the schema and other metadata at the end of the file but it
still gives a close estimation of the output file size that will be very useful
when reading/writing streams.
> Enhance ParquetWriter with exposing in-memory size of writer object
> -------------------------------------------------------------------
>
> Key: PARQUET-209
> URL: https://issues.apache.org/jira/browse/PARQUET-209
> Project: Parquet
> Issue Type: Wish
> Reporter: Reza Shiftehfar
> Labels: parquetWriter
>
> While using ParquetWriter and before closing it to write the content out to
> the disk, there is no way to check/estimate the size of the output file. This
> is useful in case we want to close files and upload them based on a minimum
> size threshold. Since ParquetWriter keeps everything in memory and only
> writes it out to disk at the very end when writer is closed, it is not
> possible to have an estimate of the output file size before closing the
> writer.
> Based on Parquet documentation, the data is written into memory object in the
> final format, meaning that the size of the object in memory is the very close
> to the final size on disk. it would be great if you can expose the current
> size of the parquetWriter object in memory. It is true that such a size will
> be different than the final output size because of adding the schema and
> other metadata at the end of the file but it still gives a close estimation
> of the output file size that will be very useful when reading/writing streams.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)