Hi folks,

We have a use-case where we want to ingest data concurrently for different
partitions. Currently Hudi doesn't support concurrent writes on the same
Hudi table.

One of the approaches we were thinking was to use one hudi table per
partition of data. So let us say we have 1000 partitions, we will have 1000
Hudi tables which will enable us to write concurrently on each partition.
And the metadata for each partition will be synced to a single metastore
table (Assumption here is schema is same for all partitions). So this
single metastore table can be used for all the spark, hive queries when
querying data. Basically this metastore glues all the different hudi table
data together in a single table.

We already tested this approach and its working fine and each partition
will have its own timeline and hudi table.

We wanted to know if there are some gotchas or any other issues with this
approach to enable concurrent writes? Or if there are any other approaches
we can take?

Thanks,
Shayan

Reply via email to