Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Vinoth Chandar Tue, 02 Jun 2020 20:53:47 -0700

Hi tanu,

For good query performance, its recommended to write optimally sized files.
Hudi already ensures that.

Generally speaking, if you have too many partitions, then it also means too
many files. Mostly people limit to 1000s of partitions in their datasets,
since queries typically crunch data based on time or a business_domain (e.g
city for uber)..  Partitioning too granular - say based on user_id - is not
very useful unless your queries only crunch per user.. if you are using
Hive metastore then 25M partitions mean 25M rows in your backing mysql
metastore db as well - not very scalable.

What I am trying to say is : even outside of Hudi, if analytics is your use
case, might be worth partitioning at lower granularity and increase rows
per parquet file.

Thanks
Vinoth

On Tue, Jun 2, 2020 at 3:18 AM Tanuj <[email protected]> wrote:

> Hi,
> We have a requirement to ingest 30M records in S3 backed up by HUDI. I am
> figuring out the partition strategy and ending up with lot of partitions
> like 25M partitions (primary partition) --> 2.5 M (secondary partition) -->
> 2.5 M (third partition) and each parquet file will have the records with
> less than 10 rows of data.
>
> Our dataset will be ingested at once in full and then it will be
> incremental daily with less than 1k updates. So its more read heavy rather
> than write heavy
>
> So what should be the suggestion in terms of HUDI performance - go ahead
> with the above partition strategy or shall I reduce my partitions and
> increase  no of rows in each parquet file.
>

Re: Suggestion needed - Hudi performance wrt no. and depth of partitions

Reply via email to