Re: [PR] [HUDI-6194] prevent flink writer getting the wrong instant to write [hudi]

via GitHub Fri, 05 Jan 2024 01:21:32 -0800


voonhous commented on PR #8673:
URL: https://github.com/apache/hudi/pull/8673#issuecomment-1878361789

Adding some illustration for future reference:

# Normal Flow

![image](https://github.com/apache/hudi/assets/6312314/da36bc4f-4410-413e-b2ad-fdc995cf560e)

As can be seen in the normal flow (between the JM and TM), a cycle is
depicted by the green box.

# TM global failover

![image](https://github.com/apache/hudi/assets/6312314/22819d88-2bea-4929-a60c-b955a4c4506e)

When there are TM restarts, global_failover will happen and
`restored_write_meta` events (which will be detected by the JM as a bootstrap
event) will be sent to the JM. Once JM has collected all the
`restored_write_meta` events, a normal cycle will be invoked again.

This is done by rolling back the commit that is already on the timeline (and
therefore, all its corresponding files that might have been written to the
filesystem (via markers). After a rollback is done, a new commit is created and
the normal flow as depicted in the section above is hence restored.

This case is handled and the job can recover properly for such failures

# JM blocked while TM global failover is occuring

Note: this case is not handled as of now.

![image](https://github.com/apache/hudi/assets/6312314/b0d7c74c-1ff5-4569-9432-151d83d06935)

As can be seen here, if the JM is blocked by a rollback/archive. In this
example, an archive, and if checkpoint timeouts, causing a global failover, TMs
will restart.

As discussed previously, TM will send `restored_write_meta` event to the JM.
Since JM's executor is being blocked at this point in time (performing an
archive) invoked from `#notifyCheckpointComplete`, it will not be able to
handle events from operator.

Once archive is done, it will be unblocked and a new instant is hence
created by the JM (following the normal execution flow). For illustration, we
will name this instant `instant_x`.

After JM's executor exits from `#notifyCheckpointComplete`. At this point,
TM will see the new `instant_x` and start peforming the writes.

JM will start handling the operator events and will perform a
`#startInstant`, rolling back the previous `instant_x` that was just created
and create a new instant, `instant_y`. TM is unaware of this and will continue
with the previous fetched `instant_x`.

Desync happens and the writer will be perform abnormally until the instant
between JM and TM lines up.

Once desync happens, alot of unhandled cases can happen, depending on the
state of the a TM when rollback of `instant_x` is happening at the JM. For
brevity, we will not discuss them as they should not be happening in the first
place.

# TLDR

Blocked JM might will cause desync between JM and TM.

Unable to reproduce the data-loss, but am able to reproduce the precursor
that may be causing the data-loss.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6194] prevent flink writer getting the wrong instant to write [hudi]

Reply via email to