[
https://issues.apache.org/jira/browse/PARQUET-156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Manish Agarwal updated PARQUET-156:
-----------------------------------
Summary: parquet document recommendations for block size and page size
given an expected number of writers (was: parquet should have automatic memory
management to avoid out of memory error. )
> parquet document recommendations for block size and page size given an
> expected number of writers
> -------------------------------------------------------------------------------------------------
>
> Key: PARQUET-156
> URL: https://issues.apache.org/jira/browse/PARQUET-156
> Project: Parquet
> Issue Type: Improvement
> Reporter: Manish Agarwal
>
> I sent a mail on dev list but I seem to have a problem with email on dev list
> so opening a bug here .
> I am on a multithreaded system where there are M threads , each thread
> creating an independent parquet writer and writing on the hdfs in its
> own independent files . I have a finite amount of RAM say R .
> Now when I created parquet writer using default block and page size i get
> heap error (no memory ) on my set up . so I reduced my block size and page
> size to very low and my system stopped giving me these out of memory errors
> and started writing the file correctly . I am able to read these files
> correctly as well .
> I should not have to make the memory low and parquet should automatically
> make sure i do not get these errors .
> But in case i have to keep track of the memory my question is as follows.
> Now keeping these values very less is not a recommended practice as i would
> loose on performance . I am particularly concerned about write performance .
> What math formula do you recommend that I should use to find correct
> blockSize , pageSize to be passed to the parquet constructor to have the
> right WRITE performance . ie how can i decide what should be the right
> blockSize , pageSize for a parquet writer given that i have M threads and
> total RAM memory available is R . I don't understand dictionaryPageSize
> need and in case i need to bother about that as well kindly let me know
> but i have kept enableDictionary flag as false .
> I am using the bellow constructor .
> public More ...ParquetWriter(
> 162 Path file,
> 163 WriteSupport<T> writeSupport,
> 164 CompressionCodecName compressionCodecName,
> 165 int blockSize,
> 166 int pageSize,
> 167 int dictionaryPageSize,
> 168 boolean enableDictionary,
> 169 boolean validating,
> 170 WriterVersion writerVersion,
> 171 Configuration conf) throws IOException {
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)