Suggestion needed - Hudi performance wrt no. and depth of partitions

Tanuj Tue, 02 Jun 2020 03:19:16 -0700

Hi,
We have a requirement to ingest 30M records in S3 backed up by HUDI. I am 
figuring out the partition strategy and ending up with lot of partitions like 
25M partitions (primary partition) --> 2.5 M (secondary partition) --> 2.5 M 
(third partition) and each parquet file will have the records with less than 10 
rows of data.


Our dataset will be ingested at once in full and then it will be incremental 
daily with less than 1k updates. So its more read heavy rather than write heavy

So what should be the suggestion in terms of HUDI performance - go ahead with 
the above partition strategy or shall I reduce my partitions and increase  no 
of rows in each parquet file.

Suggestion needed - Hudi performance wrt no. and depth of partitions

Reply via email to