[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

2020-09-13 Thread GitBox


bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-691739964


   @RajasekarSribalan : Please reopen if you still have any questions.
   
   Thanks,
   Balaji.V



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

2020-08-21 Thread GitBox


bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-678220122


   Sorry for the delay in responding ,  here is the default storage level 
config I am seeing,
   
private static final String WRITE_STATUS_STORAGE_LEVEL = 
"hoodie.write.status.storage.level";
 private static final String DEFAULT_WRITE_STATUS_STORAGE_LEVEL = 
"MEMORY_AND_DISK_SER";
   
   From the code, I can see that hudi uses the spark.persist API to manage this 
cache. So, this does not look like the problem. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

2020-08-10 Thread GitBox


bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671690639


   To understand, Are you using bulk insert for initial loading and upsert for 
subsequent operations ? 
   For records with LOBs, it is important to tune 
hoodie.copyonwrite.record.size.estimate during initial bootstrap to get the 
file sizing right.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize

2020-08-09 Thread GitBox


bvaradar commented on issue #1939:
URL: https://github.com/apache/hudi/issues/1939#issuecomment-671079032


   Regarding OOM errors, please check if which Spark stage is causing the 
failure.  You might need to tune parallelism for this. The size of parquet 
files should not be the issue. 
   
   Regarding file sizing, How did you create the initial dataset ? Did you 
change the limitFileSize parameter between commits ? What is your average 
record size. During initial commit, Hudi relies on 
hoodie.copyonwrite.record.size.estimate to estimate the average record size 
needed for file sizing. For the subsequent commits, it will auto tune based on 
previous commit metadata. May be, your record size is really large and you need 
to tune this parameter the first time you write to the dataset.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org