(sorry for the late reply) Hi - the commit time can be a logical time as well, a lot of tests work this way. There may be some table features (e.g time based cleaning) that may not work, but those are more convenience ones anyway.
I assume, the consumer would process all events at the required source timestamp boundary to achieve this? I am happy to chat/help scope the changes more. On Wed, Aug 2, 2023 at 1:17 PM Joseph Thaidigsman <jthaidigs...@slack-corp.com.invalid> wrote: > Hello, > > We have a use-case where we have persisted the full CDC changelog for some > tables in s3 and want to be able to bootstrap hudi tables with the > changelog data and then be able to time-travel the hudi table to get > snapshot views of the table on dates prior to bootstrapping. In our > changelog, we have the timestamp associated with the > inserts/updates/deletes, so the data to achieve this is present. If we had > a live consumer processing those events in real-time and writing them to a > hudi table, then we would be able to achieve this, but because we are > instead creating the hudi table from a single batch job, we are unable to > achieve it despite processing the same exact data, since time-travel is all > based on the hudi commit time. > > Aside from our specific use-case for bootstrapping tables, this would be > useful for real-time CDC consumers as well. Currently, there is no way to > guarantee the accuracy of the time-travel operation as it relates to > reflecting the state of the upstream database table at a given point in > time. For example, say you have some downstream batch pipelines that want > to perform some aggregations based on production database tables at a fixed > point each day. In the case of lag or outage on the consumer-side, when the > consumer restarts, we have a large gap in hudi commit time and are unable > to time-travel to the exact moment that the downstream pipelines expect to > reflect the database table state. > > If the hudi writer instead supported picking some field from the CDC record > as the value for the hudi commit time, then the consumer could process the > events at any time and the time-travel functionality would be the same > regardless of consumption time. This would make the writer idempotent in a > way that it currently lacks, guaranteeing consistent results for downstream > pipelines. > > Original Slack Thread: > https://apache-hudi.slack.com/archives/C4D716NPQ/p1690583690053259 >