(sorry for the late reply)

Hi - the commit time can be a logical time as well, a lot of tests work
this way. There may be some table features (e.g time based cleaning) that
may not work, but those are more convenience ones anyway.

I assume, the consumer would process all events at the required source
timestamp boundary to achieve this?

I am happy to chat/help scope the changes more.



On Wed, Aug 2, 2023 at 1:17 PM Joseph Thaidigsman
<jthaidigs...@slack-corp.com.invalid> wrote:

> Hello,
>
> We have a use-case where we have persisted the full CDC changelog for some
> tables in s3 and want to be able to bootstrap hudi tables with the
> changelog data and then be able to time-travel the hudi table to get
> snapshot views of the table on dates prior to bootstrapping. In our
> changelog, we have the timestamp associated with the
> inserts/updates/deletes, so the data to achieve this is present. If we had
> a live consumer processing those events in real-time and writing them to a
> hudi table, then we would be able to achieve this, but because we are
> instead creating the hudi table from a single batch job, we are unable to
> achieve it despite processing the same exact data, since time-travel is all
> based on the hudi commit time.
>
> Aside from our specific use-case for bootstrapping tables, this would be
> useful for real-time CDC consumers as well.  Currently, there is no way to
> guarantee the accuracy of the time-travel operation as it relates to
> reflecting the state of the upstream database table at a given point in
> time. For example, say you have some downstream batch pipelines that want
> to perform some aggregations based on production database tables at a fixed
> point each day. In the case of lag or outage on the consumer-side, when the
> consumer restarts, we have a large gap in hudi commit time and are unable
> to time-travel to the exact moment that the downstream pipelines expect to
> reflect the database table state.
>
> If the hudi writer instead supported picking some field from the CDC record
> as the value for the hudi commit time, then the consumer could process the
> events at any time and the time-travel functionality would be the same
> regardless of consumption time. This would make the writer idempotent in a
> way that it currently lacks, guaranteeing consistent results for downstream
> pipelines.
>
> Original Slack Thread:
> https://apache-hudi.slack.com/archives/C4D716NPQ/p1690583690053259
>

Reply via email to