Re: Hudi - Concurrent Writes

2020-10-19 Thread tanu dua
Thank you so much.

On Mon, 19 Oct 2020 at 10:16 PM, Balaji Varadarajan
 wrote:

>
> We are planning to add parallel writing to Hudi (at different partition)
> levels in the next release.
> Balaji.V On Friday, October 16, 2020, 11:54:51 PM PDT, tanu dua <
> tanu.dua...@gmail.com> wrote:
>
>  Hi,
> Do we have a support of concurrent writes in 0.6 as I got a similar
> requirement to ingest parallely from multiple jobs ? I am ok even if
> parallel writes are supported with different partitions.
>
> On Thu, 9 Jul 2020 at 9:22 AM, Vinoth Chandar  wrote:
>
> > We are looking into adding support for parallel writers in 0.6.0. So that
> > should help.
> >
> > I am curious to understand though why you prefer to have 1000 different
> > writer jobs, as opposed to having just one writer. Typical use cases for
> > parallel writing I have seen are related to backfills and such.
> >
> > +1 to Mario’s comment. Can’t think of anything else if your users are
> happy
> > querying 1000 tables.
> >
> > On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> > wrote:
> >
> > > hey Shayan,
> > >
> > > that seems actually a very good approach ... just curious with the glue
> > > metastore you mentioned. Would it be an external metastore for spark to
> > > query over ??? external in terms of not managed by Hudi ???
> > >
> > > that would be my only concern ... how to maintain the sync between all
> > > metadata partitions but , again, a very promising approach !
> > >
> > > regards,
> > >
> > > Mario.
> > >
> > > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati  >
> > > escreveu:
> > >
> > > > Hi folks,
> > > >
> > > > We have a use-case where we want to ingest data concurrently for
> > > different
> > > > partitions. Currently Hudi doesn't support concurrent writes on the
> > same
> > > > Hudi table.
> > > >
> > > > One of the approaches we were thinking was to use one hudi table per
> > > > partition of data. So let us say we have 1000 partitions, we will
> have
> > > 1000
> > > > Hudi tables which will enable us to write concurrently on each
> > partition.
> > > > And the metadata for each partition will be synced to a single
> > metastore
> > > > table (Assumption here is schema is same for all partitions). So this
> > > > single metastore table can be used for all the spark, hive queries
> when
> > > > querying data. Basically this metastore glues all the different hudi
> > > table
> > > > data together in a single table.
> > > >
> > > > We already tested this approach and its working fine and each
> partition
> > > > will have its own timeline and hudi table.
> > > >
> > > > We wanted to know if there are some gotchas or any other issues with
> > this
> > > > approach to enable concurrent writes? Or if there are any other
> > > approaches
> > > > we can take?
> > > >
> > > > Thanks,
> > > > Shayan
> > > >
> > >
> >


Re: Hudi - Concurrent Writes

2020-10-19 Thread Balaji Varadarajan
 
We are planning to add parallel writing to Hudi (at different partition) levels 
in the next release.
Balaji.V On Friday, October 16, 2020, 11:54:51 PM PDT, tanu dua 
 wrote:  
 
 Hi,
Do we have a support of concurrent writes in 0.6 as I got a similar
requirement to ingest parallely from multiple jobs ? I am ok even if
parallel writes are supported with different partitions.

On Thu, 9 Jul 2020 at 9:22 AM, Vinoth Chandar  wrote:

> We are looking into adding support for parallel writers in 0.6.0. So that
> should help.
>
> I am curious to understand though why you prefer to have 1000 different
> writer jobs, as opposed to having just one writer. Typical use cases for
> parallel writing I have seen are related to backfills and such.
>
> +1 to Mario’s comment. Can’t think of anything else if your users are happy
> querying 1000 tables.
>
> On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> wrote:
>
> > hey Shayan,
> >
> > that seems actually a very good approach ... just curious with the glue
> > metastore you mentioned. Would it be an external metastore for spark to
> > query over ??? external in terms of not managed by Hudi ???
> >
> > that would be my only concern ... how to maintain the sync between all
> > metadata partitions but , again, a very promising approach !
> >
> > regards,
> >
> > Mario.
> >
> > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
> > escreveu:
> >
> > > Hi folks,
> > >
> > > We have a use-case where we want to ingest data concurrently for
> > different
> > > partitions. Currently Hudi doesn't support concurrent writes on the
> same
> > > Hudi table.
> > >
> > > One of the approaches we were thinking was to use one hudi table per
> > > partition of data. So let us say we have 1000 partitions, we will have
> > 1000
> > > Hudi tables which will enable us to write concurrently on each
> partition.
> > > And the metadata for each partition will be synced to a single
> metastore
> > > table (Assumption here is schema is same for all partitions). So this
> > > single metastore table can be used for all the spark, hive queries when
> > > querying data. Basically this metastore glues all the different hudi
> > table
> > > data together in a single table.
> > >
> > > We already tested this approach and its working fine and each partition
> > > will have its own timeline and hudi table.
> > >
> > > We wanted to know if there are some gotchas or any other issues with
> this
> > > approach to enable concurrent writes? Or if there are any other
> > approaches
> > > we can take?
> > >
> > > Thanks,
> > > Shayan
> > >
> >
>  

Re: Hudi - Concurrent Writes

2020-10-17 Thread tanu dua
Hi,
Do we have a support of concurrent writes in 0.6 as I got a similar
requirement to ingest parallely from multiple jobs ? I am ok even if
parallel writes are supported with different partitions.

On Thu, 9 Jul 2020 at 9:22 AM, Vinoth Chandar  wrote:

> We are looking into adding support for parallel writers in 0.6.0. So that
> should help.
>
> I am curious to understand though why you prefer to have 1000 different
> writer jobs, as opposed to having just one writer. Typical use cases for
> parallel writing I have seen are related to backfills and such.
>
> +1 to Mario’s comment. Can’t think of anything else if your users are happy
> querying 1000 tables.
>
> On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> wrote:
>
> > hey Shayan,
> >
> > that seems actually a very good approach ... just curious with the glue
> > metastore you mentioned. Would it be an external metastore for spark to
> > query over ??? external in terms of not managed by Hudi ???
> >
> > that would be my only concern ... how to maintain the sync between all
> > metadata partitions but , again, a very promising approach !
> >
> > regards,
> >
> > Mario.
> >
> > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
> > escreveu:
> >
> > > Hi folks,
> > >
> > > We have a use-case where we want to ingest data concurrently for
> > different
> > > partitions. Currently Hudi doesn't support concurrent writes on the
> same
> > > Hudi table.
> > >
> > > One of the approaches we were thinking was to use one hudi table per
> > > partition of data. So let us say we have 1000 partitions, we will have
> > 1000
> > > Hudi tables which will enable us to write concurrently on each
> partition.
> > > And the metadata for each partition will be synced to a single
> metastore
> > > table (Assumption here is schema is same for all partitions). So this
> > > single metastore table can be used for all the spark, hive queries when
> > > querying data. Basically this metastore glues all the different hudi
> > table
> > > data together in a single table.
> > >
> > > We already tested this approach and its working fine and each partition
> > > will have its own timeline and hudi table.
> > >
> > > We wanted to know if there are some gotchas or any other issues with
> this
> > > approach to enable concurrent writes? Or if there are any other
> > approaches
> > > we can take?
> > >
> > > Thanks,
> > > Shayan
> > >
> >
>


Re: Hudi - Concurrent Writes

2020-07-09 Thread Vinoth Chandar
Great points prashant!


> So one partition has to ingest much more data than another. Basically,
one large partition delta affects the ingestion time of a smaller size
partition as well

So this indicates you want to commit writes into smaller partitions first,
without waiting for the larger partitions.. you do need concurrent writing
in that case.


> Also failure/corrupt data of one partition delta affects others if we
have single write. So we wanted these writes to be independent per
partition.

This actually seems to indicate you want restores/rollbacks at the
partition level. So if this is a hard requirement, then even concurrent
writing won’t help. You actually need separate physical tables with their
own timelines


On Thu, Jul 9, 2020 at 12:49 PM Prashant Wason 
wrote:

> With a large number of table you also run into the following potential
> issues:
> 1. Consistency: There is no single timeline so different tables (per
> partition) expose data from different times of ingestion. If the data
> within partitions is inter-dependent then the queries may see
> inconsistent results.
>
> 2. Complicated error handling / debugging: If some of the pipelines fail
> then data in some partitions may not have been updated for some time. This
> may lead to data consistency issues on the query side. Debugging any issue
> when 1000 separate datasets are involved is much more complicated then a
> single dataset (e.g. hudi-cli connects to one dataset at a time).
>
> 3. (Possibly minor) Excess load on the infra: With several parallel
> operations, the worst case load on the Namenode may go up N times (N=number
> of parallel pipelines). Under provisioned NameNode may lead to out of
> resource errors.
>
> 4. Adding new partitions would be complicated: Assuming you would want a
> new partition in future, the steps would be more involved.
>
> If only a few partitions are having the load issues, you can also look into
> the partitioning scheme.
> 1. Maybe invent a new column in the schema which is more
> uniformly distributed
> 2. Maybe split the loaded partitions into two partitions (range based or
> something like that)
> 3. If possible (depending on the ingestion source), prioritize ingestion
> for particular partitions (partition priority queue)
> 4. Limit the number of records ingested at a time to limit maximum job
> time
>
> Thanks
> Prashant
>
>
> On Thu, Jul 9, 2020 at 12:00 AM Shayan Hati  wrote:
>
> > Thanks for your response.
> >
> > @Mario: So the metastore can be something like a Glue/Hive metastore
> which
> > basically has the metadata about different partitions in a single table.
> > One challenge is per partition Hudi table can be queried using Hudi
> library
> > bundle, but across partitions it has to be queried based on the metastore
> > itself.
> >
> > @Vinoth: The use-case is we have different partitions and the data as
> well
> > as the load is skewed on them. So one partition has to ingest much more
> > data than another. Basically, one large partition delta affects the
> > ingestion time of a smaller size partition as well. Also failure/corrupt
> > data of one partition delta affects others if we have single write. So we
> > wanted these writes to be independent per partition.
> >
> > Also any timeline when 0.6.0 will be released?
> >
> > Thanks,
> > Shayan
> >
> >
> > On Thu, Jul 9, 2020 at 9:22 AM Vinoth Chandar  wrote:
> >
> > > We are looking into adding support for parallel writers in 0.6.0. So
> that
> > > should help.
> > >
> > > I am curious to understand though why you prefer to have 1000 different
> > > writer jobs, as opposed to having just one writer. Typical use cases
> for
> > > parallel writing I have seen are related to backfills and such.
> > >
> > > +1 to Mario’s comment. Can’t think of anything else if your users are
> > happy
> > > querying 1000 tables.
> > >
> > > On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> > > wrote:
> > >
> > > > hey Shayan,
> > > >
> > > > that seems actually a very good approach ... just curious with the
> glue
> > > > metastore you mentioned. Would it be an external metastore for spark
> to
> > > > query over ??? external in terms of not managed by Hudi ???
> > > >
> > > > that would be my only concern ... how to maintain the sync between
> all
> > > > metadata partitions but , again, a very promising approach !
> > > >
> > > > regards,
> > > >
> > > > Mario.
> > > >
> > > > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati <
> shayanh...@gmail.com
> > >
> > > > escreveu:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > We have a use-case where we want to ingest data concurrently for
> > > > different
> > > > > partitions. Currently Hudi doesn't support concurrent writes on the
> > > same
> > > > > Hudi table.
> > > > >
> > > > > One of the approaches we were thinking was to use one hudi table
> per
> > > > > partition of data. So let us say we have 1000 partitions, we will
> > have
> > > > 1000
> > > > > Hudi tables which will enable us to write 

Re: Hudi - Concurrent Writes

2020-07-09 Thread Prashant Wason
With a large number of table you also run into the following potential
issues:
1. Consistency: There is no single timeline so different tables (per
partition) expose data from different times of ingestion. If the data
within partitions is inter-dependent then the queries may see
inconsistent results.

2. Complicated error handling / debugging: If some of the pipelines fail
then data in some partitions may not have been updated for some time. This
may lead to data consistency issues on the query side. Debugging any issue
when 1000 separate datasets are involved is much more complicated then a
single dataset (e.g. hudi-cli connects to one dataset at a time).

3. (Possibly minor) Excess load on the infra: With several parallel
operations, the worst case load on the Namenode may go up N times (N=number
of parallel pipelines). Under provisioned NameNode may lead to out of
resource errors.

4. Adding new partitions would be complicated: Assuming you would want a
new partition in future, the steps would be more involved.

If only a few partitions are having the load issues, you can also look into
the partitioning scheme.
1. Maybe invent a new column in the schema which is more
uniformly distributed
2. Maybe split the loaded partitions into two partitions (range based or
something like that)
3. If possible (depending on the ingestion source), prioritize ingestion
for particular partitions (partition priority queue)
4. Limit the number of records ingested at a time to limit maximum job
time

Thanks
Prashant


On Thu, Jul 9, 2020 at 12:00 AM Shayan Hati  wrote:

> Thanks for your response.
>
> @Mario: So the metastore can be something like a Glue/Hive metastore which
> basically has the metadata about different partitions in a single table.
> One challenge is per partition Hudi table can be queried using Hudi library
> bundle, but across partitions it has to be queried based on the metastore
> itself.
>
> @Vinoth: The use-case is we have different partitions and the data as well
> as the load is skewed on them. So one partition has to ingest much more
> data than another. Basically, one large partition delta affects the
> ingestion time of a smaller size partition as well. Also failure/corrupt
> data of one partition delta affects others if we have single write. So we
> wanted these writes to be independent per partition.
>
> Also any timeline when 0.6.0 will be released?
>
> Thanks,
> Shayan
>
>
> On Thu, Jul 9, 2020 at 9:22 AM Vinoth Chandar  wrote:
>
> > We are looking into adding support for parallel writers in 0.6.0. So that
> > should help.
> >
> > I am curious to understand though why you prefer to have 1000 different
> > writer jobs, as opposed to having just one writer. Typical use cases for
> > parallel writing I have seen are related to backfills and such.
> >
> > +1 to Mario’s comment. Can’t think of anything else if your users are
> happy
> > querying 1000 tables.
> >
> > On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> > wrote:
> >
> > > hey Shayan,
> > >
> > > that seems actually a very good approach ... just curious with the glue
> > > metastore you mentioned. Would it be an external metastore for spark to
> > > query over ??? external in terms of not managed by Hudi ???
> > >
> > > that would be my only concern ... how to maintain the sync between all
> > > metadata partitions but , again, a very promising approach !
> > >
> > > regards,
> > >
> > > Mario.
> > >
> > > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati  >
> > > escreveu:
> > >
> > > > Hi folks,
> > > >
> > > > We have a use-case where we want to ingest data concurrently for
> > > different
> > > > partitions. Currently Hudi doesn't support concurrent writes on the
> > same
> > > > Hudi table.
> > > >
> > > > One of the approaches we were thinking was to use one hudi table per
> > > > partition of data. So let us say we have 1000 partitions, we will
> have
> > > 1000
> > > > Hudi tables which will enable us to write concurrently on each
> > partition.
> > > > And the metadata for each partition will be synced to a single
> > metastore
> > > > table (Assumption here is schema is same for all partitions). So this
> > > > single metastore table can be used for all the spark, hive queries
> when
> > > > querying data. Basically this metastore glues all the different hudi
> > > table
> > > > data together in a single table.
> > > >
> > > > We already tested this approach and its working fine and each
> partition
> > > > will have its own timeline and hudi table.
> > > >
> > > > We wanted to know if there are some gotchas or any other issues with
> > this
> > > > approach to enable concurrent writes? Or if there are any other
> > > approaches
> > > > we can take?
> > > >
> > > > Thanks,
> > > > Shayan
> > > >
> > >
> >
>
>
> --
> Shayan Hati
>


Re: Hudi - Concurrent Writes

2020-07-09 Thread Shayan Hati
Thanks for your response.

@Mario: So the metastore can be something like a Glue/Hive metastore which
basically has the metadata about different partitions in a single table.
One challenge is per partition Hudi table can be queried using Hudi library
bundle, but across partitions it has to be queried based on the metastore
itself.

@Vinoth: The use-case is we have different partitions and the data as well
as the load is skewed on them. So one partition has to ingest much more
data than another. Basically, one large partition delta affects the
ingestion time of a smaller size partition as well. Also failure/corrupt
data of one partition delta affects others if we have single write. So we
wanted these writes to be independent per partition.

Also any timeline when 0.6.0 will be released?

Thanks,
Shayan


On Thu, Jul 9, 2020 at 9:22 AM Vinoth Chandar  wrote:

> We are looking into adding support for parallel writers in 0.6.0. So that
> should help.
>
> I am curious to understand though why you prefer to have 1000 different
> writer jobs, as opposed to having just one writer. Typical use cases for
> parallel writing I have seen are related to backfills and such.
>
> +1 to Mario’s comment. Can’t think of anything else if your users are happy
> querying 1000 tables.
>
> On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera 
> wrote:
>
> > hey Shayan,
> >
> > that seems actually a very good approach ... just curious with the glue
> > metastore you mentioned. Would it be an external metastore for spark to
> > query over ??? external in terms of not managed by Hudi ???
> >
> > that would be my only concern ... how to maintain the sync between all
> > metadata partitions but , again, a very promising approach !
> >
> > regards,
> >
> > Mario.
> >
> > Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
> > escreveu:
> >
> > > Hi folks,
> > >
> > > We have a use-case where we want to ingest data concurrently for
> > different
> > > partitions. Currently Hudi doesn't support concurrent writes on the
> same
> > > Hudi table.
> > >
> > > One of the approaches we were thinking was to use one hudi table per
> > > partition of data. So let us say we have 1000 partitions, we will have
> > 1000
> > > Hudi tables which will enable us to write concurrently on each
> partition.
> > > And the metadata for each partition will be synced to a single
> metastore
> > > table (Assumption here is schema is same for all partitions). So this
> > > single metastore table can be used for all the spark, hive queries when
> > > querying data. Basically this metastore glues all the different hudi
> > table
> > > data together in a single table.
> > >
> > > We already tested this approach and its working fine and each partition
> > > will have its own timeline and hudi table.
> > >
> > > We wanted to know if there are some gotchas or any other issues with
> this
> > > approach to enable concurrent writes? Or if there are any other
> > approaches
> > > we can take?
> > >
> > > Thanks,
> > > Shayan
> > >
> >
>


-- 
Shayan Hati


Re: Hudi - Concurrent Writes

2020-07-08 Thread Vinoth Chandar
We are looking into adding support for parallel writers in 0.6.0. So that
should help.

I am curious to understand though why you prefer to have 1000 different
writer jobs, as opposed to having just one writer. Typical use cases for
parallel writing I have seen are related to backfills and such.

+1 to Mario’s comment. Can’t think of anything else if your users are happy
querying 1000 tables.

On Wed, Jul 8, 2020 at 7:28 AM Mario de Sá Vera  wrote:

> hey Shayan,
>
> that seems actually a very good approach ... just curious with the glue
> metastore you mentioned. Would it be an external metastore for spark to
> query over ??? external in terms of not managed by Hudi ???
>
> that would be my only concern ... how to maintain the sync between all
> metadata partitions but , again, a very promising approach !
>
> regards,
>
> Mario.
>
> Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
> escreveu:
>
> > Hi folks,
> >
> > We have a use-case where we want to ingest data concurrently for
> different
> > partitions. Currently Hudi doesn't support concurrent writes on the same
> > Hudi table.
> >
> > One of the approaches we were thinking was to use one hudi table per
> > partition of data. So let us say we have 1000 partitions, we will have
> 1000
> > Hudi tables which will enable us to write concurrently on each partition.
> > And the metadata for each partition will be synced to a single metastore
> > table (Assumption here is schema is same for all partitions). So this
> > single metastore table can be used for all the spark, hive queries when
> > querying data. Basically this metastore glues all the different hudi
> table
> > data together in a single table.
> >
> > We already tested this approach and its working fine and each partition
> > will have its own timeline and hudi table.
> >
> > We wanted to know if there are some gotchas or any other issues with this
> > approach to enable concurrent writes? Or if there are any other
> approaches
> > we can take?
> >
> > Thanks,
> > Shayan
> >
>


Re: Hudi - Concurrent Writes

2020-07-08 Thread Mario de Sá Vera
hey Shayan,

that seems actually a very good approach ... just curious with the glue
metastore you mentioned. Would it be an external metastore for spark to
query over ??? external in terms of not managed by Hudi ???

that would be my only concern ... how to maintain the sync between all
metadata partitions but , again, a very promising approach !

regards,

Mario.

Em qua., 8 de jul. de 2020 às 15:20, Shayan Hati 
escreveu:

> Hi folks,
>
> We have a use-case where we want to ingest data concurrently for different
> partitions. Currently Hudi doesn't support concurrent writes on the same
> Hudi table.
>
> One of the approaches we were thinking was to use one hudi table per
> partition of data. So let us say we have 1000 partitions, we will have 1000
> Hudi tables which will enable us to write concurrently on each partition.
> And the metadata for each partition will be synced to a single metastore
> table (Assumption here is schema is same for all partitions). So this
> single metastore table can be used for all the spark, hive queries when
> querying data. Basically this metastore glues all the different hudi table
> data together in a single table.
>
> We already tested this approach and its working fine and each partition
> will have its own timeline and hudi table.
>
> We wanted to know if there are some gotchas or any other issues with this
> approach to enable concurrent writes? Or if there are any other approaches
> we can take?
>
> Thanks,
> Shayan
>


Re: Hudi concurrent writes

2020-04-17 Thread Vinoth Chandar
Hi Rahul,

>> which in theory does not allow concurrent writes, but it seemed a bit
arbitrary
If you kill  your hudi write job mid way, it will most likely end up with a
pending commit (e.g rolling out new code). The rolling back of pending
commits automatically before starting a new commit, is just a usability
improvement that avoids the need to manually rollback. Strictly for
correctness, as long as the rollback happens before archival happens, we
are actually fine.

>> In the event that due to some errors if two concurrent writers do end
trying to write to the same table - would Hudi allow both to run or allow
only one to succeed.
For COW, If you turn off the automatic rollback, then two writers can
technically write and think they completed.. but the query could see a mix
of writes from both and the atomicity won't be there..

Multi writer is something we have not paid much thought to, since we solve
it externally at Uber using Kafka ... (it was a lot more
reproducible/manageable architecture IMO).
If this is a large use-case, please help us understand and we can make it
happen. We should have the bells and whistles already

Thanks
Vinoth





On Fri, Apr 17, 2020 at 11:12 AM Rahul Bhartia 
wrote:

> Hey Vinoth -
>
> I think what we are trying to understand is if Hudi has any built-in
> mechanism to prevent accidents from happening? In the event that due to
> some errors if two concurrent writers do end trying to write to the same
> table - would Hudi allow both to run or allow only one to succeed.
>
> Brandon, was looking at the code, and it seems like Hudi, by default on
> start of a new commit - will rollback any pending commit - which in theory
> does not allow concurrent writes, but it seemed a bit arbitrary. Hence we
> wanted to understand if this was actually intended to prevent concurrent
> writes on the same table - or this isn't the intention of the code below,
> and users would have to do something externally like serializing the writes
> or building some sort of locking protocol outside of Hudi?
>
> HoodieWriteClient always initialized to “rollbackPending” rolling back
> previous commits.
> Delta Sync:
> https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L479
> Spark-Writer
> :
>
> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L195
>
> HoodieWriteClient constructor
>
> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L120
> HoodieWriteClient rollbackPending Method
>
> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L1024
>
> On 2020/04/14 21:28:05, Vinoth Chandar  wrote:
> > Hi Brandon,
> >
> > This is more of practical advice than sharing how to solve it using Hudi.
> > By and large, this need can be mitigated by serializing your writes in an
> > upstream message queue like Kafka.. For e.g , lets say you want to delete
> > some records in a table, that is being currently ingested by
> > deltastreamer.. All you need to do is log more deletes as described in
> this
> > blog here, into the upstream kafka topic.. This will serialize the writes
> > automatically for you. Atleast in my experience, I found this a much more
> > efficient way of doing this, rather than allowing two writers to proceed
> > and failing the all except the latest writer..  Downstream ETL tables
> built
> > using spark jobs also typically tend to single writer. In short, I am
> > saying keep things single writer.
> >
> > That said, Of late, I have been thinking about this in the context of
> multi
> > table transactions and seeing if we actually add the support.. Love to
> have
> > some design partners if there is interest :)
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon
> >  wrote:
> >
> > > Hi all,
> > >
> > > If I understand correctly, Hudi is not currently recommended for the
> > > concurrent writer use cases. I was wondering what the community’s
> official
> > > stance on concurrency is, and what the recommended
> workarounds/solutions
> > > are for Hudi to help prevent data corruption/duplication (For example
> we’ve
> > > heard of environments using an external table lock).
> > >
> > > Thanks,
> > > Brandon
> > >
> >
>


Re: Hudi concurrent writes

2020-04-17 Thread Rahul Bhartia
Hey Vinoth -

I think what we are trying to understand is if Hudi has any built-in mechanism 
to prevent accidents from happening? In the event that due to some errors if 
two concurrent writers do end trying to write to the same table - would Hudi 
allow both to run or allow only one to succeed.

Brandon, was looking at the code, and it seems like Hudi, by default on start 
of a new commit - will rollback any pending commit - which in theory does not 
allow concurrent writes, but it seemed a bit arbitrary. Hence we wanted to 
understand if this was actually intended to prevent concurrent writes on the 
same table - or this isn't the intention of the code below, and users would 
have to do something externally like serializing the writes or building some 
sort of locking protocol outside of Hudi?

HoodieWriteClient always initialized to “rollbackPending” rolling back previous 
commits.
Delta Sync: 
https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L479
Spark-Writer: 
https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L195

HoodieWriteClient constructor
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L120
HoodieWriteClient rollbackPending Method
https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L1024

On 2020/04/14 21:28:05, Vinoth Chandar  wrote: 
> Hi Brandon,
> 
> This is more of practical advice than sharing how to solve it using Hudi.
> By and large, this need can be mitigated by serializing your writes in an
> upstream message queue like Kafka.. For e.g , lets say you want to delete
> some records in a table, that is being currently ingested by
> deltastreamer.. All you need to do is log more deletes as described in this
> blog here, into the upstream kafka topic.. This will serialize the writes
> automatically for you. Atleast in my experience, I found this a much more
> efficient way of doing this, rather than allowing two writers to proceed
> and failing the all except the latest writer..  Downstream ETL tables built
> using spark jobs also typically tend to single writer. In short, I am
> saying keep things single writer.
> 
> That said, Of late, I have been thinking about this in the context of multi
> table transactions and seeing if we actually add the support.. Love to have
> some design partners if there is interest :)
> 
> Thanks
> Vinoth
> 
> On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon
>  wrote:
> 
> > Hi all,
> >
> > If I understand correctly, Hudi is not currently recommended for the
> > concurrent writer use cases. I was wondering what the community’s official
> > stance on concurrency is, and what the recommended workarounds/solutions
> > are for Hudi to help prevent data corruption/duplication (For example we’ve
> > heard of environments using an external table lock).
> >
> > Thanks,
> > Brandon
> >
> 


Re: Hudi concurrent writes

2020-04-14 Thread Vinoth Chandar
Hi Brandon,

This is more of practical advice than sharing how to solve it using Hudi.
By and large, this need can be mitigated by serializing your writes in an
upstream message queue like Kafka.. For e.g , lets say you want to delete
some records in a table, that is being currently ingested by
deltastreamer.. All you need to do is log more deletes as described in this
blog here, into the upstream kafka topic.. This will serialize the writes
automatically for you. Atleast in my experience, I found this a much more
efficient way of doing this, rather than allowing two writers to proceed
and failing the all except the latest writer..  Downstream ETL tables built
using spark jobs also typically tend to single writer. In short, I am
saying keep things single writer.

That said, Of late, I have been thinking about this in the context of multi
table transactions and seeing if we actually add the support.. Love to have
some design partners if there is interest :)

Thanks
Vinoth

On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon
 wrote:

> Hi all,
>
> If I understand correctly, Hudi is not currently recommended for the
> concurrent writer use cases. I was wondering what the community’s official
> stance on concurrency is, and what the recommended workarounds/solutions
> are for Hudi to help prevent data corruption/duplication (For example we’ve
> heard of environments using an external table lock).
>
> Thanks,
> Brandon
>


Re: Hudi concurrent writes

2020-04-14 Thread Scheller, Brandon
Hi Nick,

I’m not exactly looking to meet a use case. I’m more looking to understand what 
the Hudi community’s plan and current recommendations are surrounding 
concurrency. Very interested to hear what others have done in this regard. 
Also, I think allowing multiple avro log files could help solve part of this 
problem as you describe, but we still run into the issue that Hudi only 
supports upserts to log file and not inserts.

-Brandon

From: Semantic Beeng 
Date: Tuesday, April 14, 2020 at 1:28 PM
To: "dev@hudi.apache.org" , "Scheller, Brandon" 

Subject: RE: [EXTERNAL] Hudi concurrent writes


CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.


Hi Brandon,

Can you please elaborate on your use case?

Mine is about concurrent feature extraction processes that would need to write 
to the same target table and it could be addressed if Hudi allowed multiple MOR 
timelines  
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=135860485 per 
table (now there is the concept of MOR table).

With multiple/concurrent MOR timelines we could merge the timelines as units of 
work and get some form of logical concurrency.

Hope this makes sense.
Would this work for you?

Please advise
Nick
On April 14, 2020 at 12:23 PM "Scheller, Brandon" < 
bsche...@amazon.com.INVALID> wrote:


Hi all,

If I understand correctly, Hudi is not currently recommended for the concurrent 
writer use cases. I was wondering what the community’s official stance on 
concurrency is, and what the recommended workarounds/solutions are for Hudi to 
help prevent data corruption/duplication (For example we’ve heard of 
environments using an external table lock).

Thanks,
Brandon