[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
bvaradar commented on issue #1939: URL: https://github.com/apache/hudi/issues/1939#issuecomment-691739964 @RajasekarSribalan : Please reopen if you still have any questions. Thanks, Balaji.V This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
bvaradar commented on issue #1939: URL: https://github.com/apache/hudi/issues/1939#issuecomment-678220122 Sorry for the delay in responding , here is the default storage level config I am seeing, private static final String WRITE_STATUS_STORAGE_LEVEL = "hoodie.write.status.storage.level"; private static final String DEFAULT_WRITE_STATUS_STORAGE_LEVEL = "MEMORY_AND_DISK_SER"; From the code, I can see that hudi uses the spark.persist API to manage this cache. So, this does not look like the problem. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
bvaradar commented on issue #1939: URL: https://github.com/apache/hudi/issues/1939#issuecomment-671690639 To understand, Are you using bulk insert for initial loading and upsert for subsequent operations ? For records with LOBs, it is important to tune hoodie.copyonwrite.record.size.estimate during initial bootstrap to get the file sizing right. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] bvaradar commented on issue #1939: [SUPPORT] Hudi creating parquet with huge size and not in sink with limitFileSize
bvaradar commented on issue #1939: URL: https://github.com/apache/hudi/issues/1939#issuecomment-671079032 Regarding OOM errors, please check if which Spark stage is causing the failure. You might need to tune parallelism for this. The size of parquet files should not be the issue. Regarding file sizing, How did you create the initial dataset ? Did you change the limitFileSize parameter between commits ? What is your average record size. During initial commit, Hudi relies on hoodie.copyonwrite.record.size.estimate to estimate the average record size needed for file sizing. For the subsequent commits, it will auto tune based on previous commit metadata. May be, your record size is really large and you need to tune this parameter the first time you write to the dataset. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org