Hi tanu, For good query performance, its recommended to write optimally sized files. Hudi already ensures that.
Generally speaking, if you have too many partitions, then it also means too many files. Mostly people limit to 1000s of partitions in their datasets, since queries typically crunch data based on time or a business_domain (e.g city for uber).. Partitioning too granular - say based on user_id - is not very useful unless your queries only crunch per user.. if you are using Hive metastore then 25M partitions mean 25M rows in your backing mysql metastore db as well - not very scalable. What I am trying to say is : even outside of Hudi, if analytics is your use case, might be worth partitioning at lower granularity and increase rows per parquet file. Thanks Vinoth On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]> wrote: > Hi, > We have a requirement to ingest 30M records in S3 backed up by HUDI. I am > figuring out the partition strategy and ending up with lot of partitions > like 25M partitions (primary partition) --> 2.5 M (secondary partition) --> > 2.5 M (third partition) and each parquet file will have the records with > less than 10 rows of data. > > Our dataset will be ingested at once in full and then it will be > incremental daily with less than 1k updates. So its more read heavy rather > than write heavy > > So what should be the suggestion in terms of HUDI performance - go ahead > with the above partition strategy or shall I reduce my partitions and > increase no of rows in each parquet file. >
