Timestamp Based Incremental Reading in Iceberg ...

Gautam Tue, 08 Sep 2020 09:07:27 -0700

Hello Devs,
                   We are looking into adding workflows that read data
incrementally based on commit time. The ability to read deltas between
start / end commit timestamps on a table and ability to resume reading from
last read end timestamp. In that regard, we need the timestamps to be
linear in the current active snapshot history (newer versions always have
higher timestamps). Although Iceberg commit flow ensures the versions are
newer, there isn't a check to ensure timestamps are linear.


Example flow, if two clients (clientA and clientB), whose time-clocks are
slightly off (say by a couple seconds), are committing frequently, clientB
might get to commit after clientA even if it's new snapshot timestamps is
out of order. I might be wrong but I haven't found a check in
HadoopTableOperations.commit() to ensure this above case does not happen.

On the other hand, restricting commits due to out-of-order timestamps can
hurt commit throughput so I can see why this isn't something Iceberg might
want to enforce based on System.currentTimeMillis(). Although if clients
had a way to define their own globally synchronized timestamps (using
external service or some monotonically increasing UUID) then iceberg could
allow an API to set that on the snapshot or use that instead of
System.currentTimeMillis(). Iceberg exposes something similar using
Sequence numbers in v2 format to track Deletes and Appends.

Is this a concern others have? If so how are folks handling this today or
are they not exposing such a feature at all due to the inherent distributed
timing problem? Would like to hear how others are thinking/going about
this. Thoughts?

Cheers,

-Gautam.

Timestamp Based Incremental Reading in Iceberg ...

Reply via email to