[ 
https://issues.apache.org/jira/browse/PARQUET-156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261258#comment-14261258
 ] 

Ryan Blue commented on PARQUET-156:
-----------------------------------

I'm interested in what you find out on tuning, Manish. For this issue, I think 
Brock is right that PARQUET-108 covers the "automatic memory management to 
avoid OOM". Maybe you could update this to "document recommendations for block 
size and page size given an expected number of writers"? I'm not sure that is 
very valuable though. Really, you want your block size above a certain minimum, 
which means you ideally wouldn't use a the memory manager. While a manager 
keeps you from hitting OOM, it will degrade read performance if it makes the 
parquet row group size (parquet block size) too small.

Here are some general ideas to follow:
1. Row group size should always be smaller than HDFS block size.
2. A multiple of the row group size should be approximately (maybe a little 
less than) the HDFS block size. For example, 2 row groups might fit in a single 
HDFS block.
3. Remember row group size is an indicator of the memory footprint of each open 
file for both reading and writing. Reading will ideally be smaller, but this is 
down to a constant factor from the columns you project.
4. Keep the expected number of open writers (M in your case) times the expected 
consumption below your memory threshold (lower than total heap) and avoid 
letting the memory manager do this for you.

> parquet should have automatic memory management to avoid out of memory error. 
> ------------------------------------------------------------------------------
>
>                 Key: PARQUET-156
>                 URL: https://issues.apache.org/jira/browse/PARQUET-156
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Manish Agarwal
>
> I sent a mail on dev list but I seem to have a problem with email on dev list 
>  so opening a bug here . 
> I am on a multithreaded system where there are M threads , each thread 
> creating   an   independent parquet writer  and writing on the hdfs  in its 
> own independent files  . I have a  finite amount of RAM  say R  .  
> Now when I created  parquet writer using default block and page size i get 
> heap error (no memory )  on my set up  . so I reduced my block size and page 
> size to very low and  my system stopped giving me these out of memory errors 
> and started writing the file correctly . I am able to read these files 
> correctly as well  . 
> I should not have to make the memory low and parquet should automatically 
> make sure i do not get these errors . 
> But in case i have to keep track of the memory my question is as follows. 
> Now keeping these  values very less is not a recommended practice as i would 
> loose on performance . I am particularly concerned about  write performance . 
>  What math formula  do you recommend that I  should use to find correct 
> blockSize , pageSize to be passed to the parquet constructor to have   the 
> right  WRITE  performance  . ie how can i decide what should be the right 
> blockSize , pageSize  for a parquet writer given that i have M threads and 
> total RAM memory available is R   . I don't understand dictionaryPageSize 
> need and in case i    need to bother about that as well kindly let me know 
> but i have kept enableDictionary flag as false . 
> I am using the bellow constructor .
> public More ...ParquetWriter(
> 162      Path file,
> 163      WriteSupport<T> writeSupport,
> 164      CompressionCodecName compressionCodecName,
> 165      int blockSize,
> 166      int pageSize,
> 167      int dictionaryPageSize,
> 168      boolean enableDictionary,
> 169      boolean validating,
> 170      WriterVersion writerVersion,
> 171      Configuration conf) throws IOException {



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to