xushiyan commented on issue #3933: URL: https://github.com/apache/hudi/issues/3933#issuecomment-974875070
@Limess a few questions - for this dataset do you want to run bulkinsert or upsert? if it's append only dataset, then bulkinsert should be the mode - does smaller parquet files matter a lot in this case? `GLOBAL_SORT` is expected to use more diskspace as it does a shuffling sort to line up records for bulk writing. If you change to `NONE` sort mode, you'd get more more small files but faster write and less disk spill. Small files can be mitigated by clustering. There is some trade-off to consider based on your needs. - 10036 parallelism may be a bit too high; some number around 3600 = # executors x # cores x 2 ? @nsivabalan @yihua any other tuning tips could help here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org