Re: Hudi concurrent writes

Vinoth Chandar Fri, 17 Apr 2020 11:25:22 -0700

Hi Rahul,

>> which in theory does not allow concurrent writes, but it seemed a bit
arbitrary
If you kill  your hudi write job mid way, it will most likely end up with a
pending commit (e.g rolling out new code). The rolling back of pending
commits automatically before starting a new commit, is just a usability
improvement that avoids the need to manually rollback. Strictly for
correctness, as long as the rollback happens before archival happens, we
are actually fine.


>> In the event that due to some errors if two concurrent writers do end
trying to write to the same table - would Hudi allow both to run or allow
only one to succeed.
For COW, If you turn off the automatic rollback, then two writers can
technically write and think they completed.. but the query could see a mix
of writes from both and the atomicity won't be there..

Multi writer is something we have not paid much thought to, since we solve
it externally at Uber using Kafka ... (it was a lot more
reproducible/manageable architecture IMO).
If this is a large use-case, please help us understand and we can make it
happen. We should have the bells and whistles already

Thanks
Vinoth





On Fri, Apr 17, 2020 at 11:12 AM Rahul Bhartia <rahul.bhar...@gmail.com>
wrote:

> Hey Vinoth -
>
> I think what we are trying to understand is if Hudi has any built-in
> mechanism to prevent accidents from happening? In the event that due to
> some errors if two concurrent writers do end trying to write to the same
> table - would Hudi allow both to run or allow only one to succeed.
>
> Brandon, was looking at the code, and it seems like Hudi, by default on
> start of a new commit - will rollback any pending commit - which in theory
> does not allow concurrent writes, but it seemed a bit arbitrary. Hence we
> wanted to understand if this was actually intended to prevent concurrent
> writes on the same table - or this isn't the intention of the code below,
> and users would have to do something externally like serializing the writes
> or building some sort of locking protocol outside of Hudi?
>
> HoodieWriteClient always initialized to “rollbackPending” rolling back
> previous commits.
> Delta Sync:
> https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L479
> Spark-Writer
> <https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L479Spark-Writer>:
>
> https://github.com/apache/incubator-hudi/blob/master/hudi-spark/src/main/java/org/apache/hudi/DataSourceUtils.java#L195
>
> HoodieWriteClient constructor
>
> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L120
> HoodieWriteClient rollbackPending Method
>
> https://github.com/apache/incubator-hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java#L1024
>
> On 2020/04/14 21:28:05, Vinoth Chandar <vin...@apache.org> wrote:
> > Hi Brandon,
> >
> > This is more of practical advice than sharing how to solve it using Hudi.
> > By and large, this need can be mitigated by serializing your writes in an
> > upstream message queue like Kafka.. For e.g , lets say you want to delete
> > some records in a table, that is being currently ingested by
> > deltastreamer.. All you need to do is log more deletes as described in
> this
> > blog here, into the upstream kafka topic.. This will serialize the writes
> > automatically for you. Atleast in my experience, I found this a much more
> > efficient way of doing this, rather than allowing two writers to proceed
> > and failing the all except the latest writer..  Downstream ETL tables
> built
> > using spark jobs also typically tend to single writer. In short, I am
> > saying keep things single writer.
> >
> > That said, Of late, I have been thinking about this in the context of
> multi
> > table transactions and seeing if we actually add the support.. Love to
> have
> > some design partners if there is interest :)
> >
> > Thanks
> > Vinoth
> >
> > On Tue, Apr 14, 2020 at 9:23 AM Scheller, Brandon
> > <bsche...@amazon.com.invalid> wrote:
> >
> > > Hi all,
> > >
> > > If I understand correctly, Hudi is not currently recommended for the
> > > concurrent writer use cases. I was wondering what the community’s
> official
> > > stance on concurrency is, and what the recommended
> workarounds/solutions
> > > are for Hudi to help prevent data corruption/duplication (For example
> we’ve
> > > heard of environments using an external table lock).
> > >
> > > Thanks,
> > > Brandon
> > >
> >
>

Re: Hudi concurrent writes

Reply via email to