Thanks for your response.

@Mario: So the metastore can be something like a Glue/Hive metastore which
basically has the metadata about different partitions in a single table.
One challenge is per partition Hudi table can be queried using Hudi library
bundle, but across partitions it has to be queried based on the metastore
itself.

@Vinoth: The use-case is we have different partitions and the data as well
as the load is skewed on them. So one partition has to ingest much more
data than another. Basically, one large partition delta affects the
ingestion time of a smaller size partition as well. Also failure/corrupt
data of one partition delta affects others if we have single write. So we
wanted these writes to be independent per partition.

Also any timeline when 0.6.0 will be released?

Thanks,
Shayan


On Thu, Jul 9, 2020 at 9:22 AM Vinoth Chandar <vin...@apache.org> wrote:

> We are looking into adding support for parallel writers in 0.6.0. So that
> should help.
>
> I am curious to understand though why you prefer to have 1000 different
> writer jobs, as opposed to having just one writer. Typical use cases for
> parallel writing I have seen are related to backfills and such.
>
> +1 to Mario’s comment. Can’t think of anything else if your users are happy
> querying 1000 tables.
>
> On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera <desav...@gmail.com>
> wrote:
>
> > hey Shayan,
> >
> > that seems actually a very good approach ... just curious with the glue
> > metastore you mentioned. Would it be an external metastore for spark to
> > query over ??? external in terms of not managed by Hudi ???
> >
> > that would be my only concern ... how to maintain the sync between all
> > metadata partitions but , again, a very promising approach !
> >
> > regards,
> >
> > Mario.
> >
> > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati <shayanh...@gmail.com>
> > escreveu:
> >
> > > Hi folks,
> > >
> > > We have a use-case where we want to ingest data concurrently for
> > different
> > > partitions. Currently Hudi doesn't support concurrent writes on the
> same
> > > Hudi table.
> > >
> > > One of the approaches we were thinking was to use one hudi table per
> > > partition of data. So let us say we have 1000 partitions, we will have
> > 1000
> > > Hudi tables which will enable us to write concurrently on each
> partition.
> > > And the metadata for each partition will be synced to a single
> metastore
> > > table (Assumption here is schema is same for all partitions). So this
> > > single metastore table can be used for all the spark, hive queries when
> > > querying data. Basically this metastore glues all the different hudi
> > table
> > > data together in a single table.
> > >
> > > We already tested this approach and its working fine and each partition
> > > will have its own timeline and hudi table.
> > >
> > > We wanted to know if there are some gotchas or any other issues with
> this
> > > approach to enable concurrent writes? Or if there are any other
> > approaches
> > > we can take?
> > >
> > > Thanks,
> > > Shayan
> > >
> >
>


-- 
Shayan Hati

Reply via email to