[ 
https://issues.apache.org/jira/browse/PARQUET-156?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manish Agarwal updated PARQUET-156:
-----------------------------------
    Summary: Document recommendations for block size and page size given an 
expected number of writers  (was: parquet document recommendations for block 
size and page size given an expected number of writers)

> Document recommendations for block size and page size given an expected 
> number of writers
> -----------------------------------------------------------------------------------------
>
>                 Key: PARQUET-156
>                 URL: https://issues.apache.org/jira/browse/PARQUET-156
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Manish Agarwal
>
> I sent a mail on dev list but I seem to have a problem with email on dev list 
>  so opening a bug here . 
> I am on a multithreaded system where there are M threads , each thread 
> creating   an   independent parquet writer  and writing on the hdfs  in its 
> own independent files  . I have a  finite amount of RAM  say R  .  
> Now when I created  parquet writer using default block and page size i get 
> heap error (no memory )  on my set up  . so I reduced my block size and page 
> size to very low and  my system stopped giving me these out of memory errors 
> and started writing the file correctly . I am able to read these files 
> correctly as well  . 
> I should not have to make the memory low and parquet should automatically 
> make sure i do not get these errors . 
> But in case i have to keep track of the memory my question is as follows. 
> Now keeping these  values very less is not a recommended practice as i would 
> loose on performance . I am particularly concerned about  write performance . 
>  What math formula  do you recommend that I  should use to find correct 
> blockSize , pageSize to be passed to the parquet constructor to have   the 
> right  WRITE  performance  . ie how can i decide what should be the right 
> blockSize , pageSize  for a parquet writer given that i have M threads and 
> total RAM memory available is R   . I don't understand dictionaryPageSize 
> need and in case i    need to bother about that as well kindly let me know 
> but i have kept enableDictionary flag as false . 
> I am using the bellow constructor .
> public More ...ParquetWriter(
> 162      Path file,
> 163      WriteSupport<T> writeSupport,
> 164      CompressionCodecName compressionCodecName,
> 165      int blockSize,
> 166      int pageSize,
> 167      int dictionaryPageSize,
> 168      boolean enableDictionary,
> 169      boolean validating,
> 170      WriterVersion writerVersion,
> 171      Configuration conf) throws IOException {



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to