Re: [DISCUSS] row timestamp proposal

Anton Okolnychyi Thu, 22 Jan 2026 16:12:25 -0800

Also, do we have a concrete plan for how to handle tables that would be
upgraded to V4? What timestamp will we assign to existing rows?


On Wed, Jan 21, 2026 at 3:59 PM Anton Okolnychyi <[email protected]>
wrote:

> If we ignore temporal queries that need strict snapshot boundaries and
> can't be solved completely using row timestamps in case of mutations, you
> mentioned other use cases when row timestamps may be helpful like TTL and
> auditing. We can debate whether using CURRENT_TIMESTAMP() is enough for
> them, but I don't really see a point given that we already have row lineage
> in V3 and the storage overhead for one more field isn't likely to be
> noticable. One of the problems with CURRENT_TIMESTAMP() is the required
> action by the user. Having a reliable row timestamp populated automatically
> is likely to be better, so +1.
>
> пт, 16 січ. 2026 р. о 14:30 Steven Wu <[email protected]> пише:
>
>> Joining with snapshot history also has significant complexity. It
>> requires retaining the entire snapshot history with probably trimmed
>> snapshot metadata. There are concerns on the size of the snapshot history
>> for tables with frequent commits (like streaming ingestion). Do we maintain
>> the unbounded trimmed snapshot history in the same table metadata, which
>> could affect table metadata.json size? or store it separately somewhere
>> (like in catalog), which would require the complexity of multi-entity
>> transaction in catalog?
>>
>>
>> On Fri, Jan 16, 2026 at 12:07 PM Russell Spitzer <
>> [email protected]> wrote:
>>
>>> I've gone back and forth on the inherited columns. I think the thing
>>> which keeps coming back to me is that I don't
>>> like that the only way to determine the timestamp associated with a row
>>> update/creation is to do a join back
>>> against table metadata. While that's doable, It feels user unfriendly.
>>>
>>>
>>>
>>> On Fri, Jan 16, 2026 at 11:54 AM Steven Wu <[email protected]> wrote:
>>>
>>>> Anton, you are right that the row-level deletes will be a problem for
>>>> some of the mentioned use cases (like incremental processing). I have
>>>> clarified the applicability of some use cases to "tables with inserts and
>>>> updates only".
>>>>
>>>> Right now, we are only tracking modification/commit time (not insertion
>>>> time) in case of updates.
>>>>
>>>> On Thu, Jan 15, 2026 at 6:33 PM Anton Okolnychyi <[email protected]>
>>>> wrote:
>>>>
>>>>> I think there is clear consensus that making snapshot timestamps
>>>>> strictly increasing is a positive thing. I am also +1.
>>>>>
>>>>> - How will row timestamps allow us to reliably implement incremental
>>>>> consumption independent of the snapshot retention given that rows can be
>>>>> added AND removed in a particular time frame? How can we capture all
>>>>> changes by just looking at the latest snapshot?
>>>>> - Some use cases in the doc need the insertion time and some need the
>>>>> last modification time. Do we plan to support both?
>>>>> - What do we expect the behavior to be in UPDATE and MERGE operations?
>>>>>
>>>>> To be clear: I am not opposed to this change, just want to make sure I
>>>>> understand all use cases that we aim to address and what would be required
>>>>> in engines.
>>>>>
>>>>> чт, 15 січ. 2026 р. о 17:01 Maninder Parmar <
>>>>> [email protected]> пише:
>>>>>
>>>>>> +1 for improving how the commit timestamps are assigned monotonically
>>>>>> since this requirement has emerged over multiple discussions like
>>>>>> notifications, multi-table transactions, time travel accuracy and row
>>>>>> timestamps. It would be good to have a single consistent way to represent
>>>>>> and assign timestamps that could be leveraged across multiple features.
>>>>>>
>>>>>> On Thu, Jan 15, 2026 at 4:05 PM Ryan Blue <[email protected]> wrote:
>>>>>>
>>>>>>> Yeah, to add my perspective on that discussion, I think my primary
>>>>>>> concern is that people expect timestamps to be monotonic and if they 
>>>>>>> aren't
>>>>>>> then a `_last_update_timestamp` field just makes the problem worse. But 
>>>>>>> it
>>>>>>> is _nice_ to have row-level timestamps. So I would be okay if we revisit
>>>>>>> how we assign commit timestamps and improve it so that you get monotonic
>>>>>>> behavior.
>>>>>>>
>>>>>>> On Thu, Jan 15, 2026 at 2:23 PM Steven Wu <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We had an offline discussion with Ryan. I revised the proposal as
>>>>>>>> follows.
>>>>>>>>
>>>>>>>> 1. V4 would require writers to generate *monotonic* snapshot
>>>>>>>> timestamps. The proposal doc has a section that describes a recommended
>>>>>>>> implementation using lamport timestamps.
>>>>>>>> 2. Expose *last_update_timestamp* metadata column that inherits
>>>>>>>> from snapshot timestamp
>>>>>>>>
>>>>>>>> This is a relatively low-friction change that can fix the time
>>>>>>>> travel problem and enable use cases like latency tracking, temporal 
>>>>>>>> query,
>>>>>>>> TTL, auditing.
>>>>>>>>
>>>>>>>> There is no accuracy requirement on the timestamp values. In
>>>>>>>> practice, modern servers with NTP have pretty reliable wall clocks. 
>>>>>>>> E.g.,
>>>>>>>> Java library implemented this validation
>>>>>>>> <https://github.com/apache/iceberg/blob/035e0fb39d2a949f6343552ade0a7d6c2967e0db/core/src/main/java/org/apache/iceberg/TableMetadata.java#L369-L377>
>>>>>>>>  that
>>>>>>>> protects against backward clock drift up to one minute for snapshot
>>>>>>>> timestamps. Don't think we have heard many complaints of commit 
>>>>>>>> failure due
>>>>>>>> to that clock drift validation.
>>>>>>>>
>>>>>>>> Would appreciate feedback on the revised proposal.
>>>>>>>>
>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Steven
>>>>>>>>
>>>>>>>> On Tue, Jan 13, 2026 at 8:40 PM Anton Okolnychyi <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Steven, I was referring to the fact that CURRENT_TIMESTAMP() is
>>>>>>>>> usually evaluated quite early in engines so we could theoretically 
>>>>>>>>> have
>>>>>>>>> another expression closer to the commit time. You are right, though, 
>>>>>>>>> it
>>>>>>>>> won't be the actual commit time given that we have to write it into 
>>>>>>>>> the
>>>>>>>>> files. Also, I don't think generating a timestamp for a row as it is 
>>>>>>>>> being
>>>>>>>>> written is going to be beneficial. To sum up, expression-based 
>>>>>>>>> defaults
>>>>>>>>> would allow us to capture the time the transaction or write starts, 
>>>>>>>>> but not
>>>>>>>>> the actual commit time.
>>>>>>>>>
>>>>>>>>> Russell, if the goal is to know what happened to the table in a
>>>>>>>>> given time frame, isn't the changelog scan the way to go? It would 
>>>>>>>>> assign
>>>>>>>>> commit ordinals based on lineage and include row-level diffs. How 
>>>>>>>>> would you
>>>>>>>>> be able to determine changes with row timestamps by just looking at 
>>>>>>>>> the
>>>>>>>>> latest snapshot?
>>>>>>>>>
>>>>>>>>> It does seem promising to make snapshot timestamps strictly
>>>>>>>>> increasing to avoid ambiguity during time travel.
>>>>>>>>>
>>>>>>>>> вт, 13 січ. 2026 р. о 16:33 Ryan Blue <[email protected]> пише:
>>>>>>>>>
>>>>>>>>>> > Whether or not "t" is an atomic clock time is not as important
>>>>>>>>>> as the query between time bounds making sense.
>>>>>>>>>>
>>>>>>>>>> I'm not sure I get it then. If we want monotonically increasing
>>>>>>>>>> times, but they don't have to be real times then how do you know what
>>>>>>>>>> notion of "time" you care about for these filters? Or to put it 
>>>>>>>>>> another
>>>>>>>>>> way, how do you know that your "before" and "after" times are 
>>>>>>>>>> reasonable?
>>>>>>>>>> If the boundaries of these time queries can move around a bit, by 
>>>>>>>>>> how much?
>>>>>>>>>>
>>>>>>>>>> It seems to me that row IDs can play an important role here
>>>>>>>>>> because you have the order guarantee that we seem to want for this 
>>>>>>>>>> use
>>>>>>>>>> case: if snapshot A was committed before snapshot B, then the rows 
>>>>>>>>>> from A
>>>>>>>>>> have row IDs that are always less than the rows IDs of B. The 
>>>>>>>>>> problem is
>>>>>>>>>> that we don't know where those row IDs start and end once A and B 
>>>>>>>>>> are no
>>>>>>>>>> longer tracked. Using a "timestamp" seems to work, but I still worry 
>>>>>>>>>> that
>>>>>>>>>> without reliable timestamps that correspond with some guarantee to 
>>>>>>>>>> real
>>>>>>>>>> timestamps, we are creating a feature that seems reliable but isn't.
>>>>>>>>>>
>>>>>>>>>> I'm somewhat open to the idea of introducing a snapshot timestamp
>>>>>>>>>> that the catalog guarantees is monotonically increasing. But if we 
>>>>>>>>>> did
>>>>>>>>>> that, wouldn't we still need to know the association between these
>>>>>>>>>> timestamps and snapshots after the snapshot metadata expires? My 
>>>>>>>>>> mental
>>>>>>>>>> model is that this would be used to look for data that arrived, say, 
>>>>>>>>>> 3
>>>>>>>>>> weeks ago on Dec 24th. Since the snapshots metadata is no longer 
>>>>>>>>>> around we
>>>>>>>>>> could use the row timestamp to find those rows. But how do we know 
>>>>>>>>>> that the
>>>>>>>>>> snapshot timestamps correspond to the actual timestamp range of Dec 
>>>>>>>>>> 24th?
>>>>>>>>>> Is it just "close enough" as long as we don't have out of order 
>>>>>>>>>> timestamps?
>>>>>>>>>> This is what I mean by needing to keep track of the association 
>>>>>>>>>> between
>>>>>>>>>> timestamps and snapshots after the metadata expires. Seems like you 
>>>>>>>>>> either
>>>>>>>>>> need to keep track of what the catalog's clock was for events you 
>>>>>>>>>> care
>>>>>>>>>> about, or you don't really care about exact timestamps.
>>>>>>>>>>
>>>>>>>>>> On Tue, Jan 13, 2026 at 2:22 PM Russell Spitzer <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> The key goal here is the ability to answer the question "what
>>>>>>>>>>> happened to the table in some time window. (before < t < after)?"
>>>>>>>>>>> Whether or not "t" is an atomic clock time is not as important
>>>>>>>>>>> as the query between time bounds making sense.
>>>>>>>>>>> Downstream applications (from what I know) are mostly sensitive
>>>>>>>>>>> to getting discrete and well defined answers to
>>>>>>>>>>> this question like:
>>>>>>>>>>>
>>>>>>>>>>> 1 < t < 2 should be exclusive of
>>>>>>>>>>> 2 < t < 3 should be exclusive of
>>>>>>>>>>> 3 < t < 4
>>>>>>>>>>>
>>>>>>>>>>> And the union of these should be the same as the query asking
>>>>>>>>>>> for 1 < t < 4
>>>>>>>>>>>
>>>>>>>>>>> Currently this is not possible because we have no guarantee of
>>>>>>>>>>> ordering in our timestamps
>>>>>>>>>>>
>>>>>>>>>>> Snapshots
>>>>>>>>>>> A -> B -> C
>>>>>>>>>>> Sequence numbers
>>>>>>>>>>> 50 -> 51 ->  52
>>>>>>>>>>> Timestamp
>>>>>>>>>>> 3 -> 1 -> 2
>>>>>>>>>>>
>>>>>>>>>>> This makes time travel always a little wrong to start with.
>>>>>>>>>>>
>>>>>>>>>>> The Java implementation only allows one minute of negative time
>>>>>>>>>>> on commit so we actually kind of do have this as a
>>>>>>>>>>> "light monotonicity" requirement but as noted above there is no
>>>>>>>>>>> spec requirement for this.  While we do have sequence
>>>>>>>>>>> number and row id, we still don't have a stable way of
>>>>>>>>>>> associating these with a consistent time in an engine independent 
>>>>>>>>>>> way.
>>>>>>>>>>>
>>>>>>>>>>> Ideally we just want to have one consistent way of answering the
>>>>>>>>>>> question "what did the table look like at time t"
>>>>>>>>>>> which I think we get by adding in a new field that is a
>>>>>>>>>>> timestamp, set by the Catalog close to commit time,
>>>>>>>>>>> that always goes up.
>>>>>>>>>>>
>>>>>>>>>>> I'm not sure we can really do this with an engine expression
>>>>>>>>>>> since they won't know when the data is actually committed
>>>>>>>>>>> when writing files?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jan 13, 2026 at 3:35 PM Anton Okolnychyi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This seems like a lot of new complexity in the format. I would
>>>>>>>>>>>> like us to explore whether we can build the considered use cases 
>>>>>>>>>>>> on top of
>>>>>>>>>>>> expression-based defaults instead.
>>>>>>>>>>>>
>>>>>>>>>>>> We already plan to support CURRENT_TIMESTAMP() and similar
>>>>>>>>>>>> functions that are part of the SQL standard definition for default 
>>>>>>>>>>>> values.
>>>>>>>>>>>> This would provide us a way to know the relative row order. True, 
>>>>>>>>>>>> this
>>>>>>>>>>>> usually will represent the start of the operation. We may define
>>>>>>>>>>>> COMMIT_TIMESTAMP() or a similar expression for the actual commit 
>>>>>>>>>>>> time, if
>>>>>>>>>>>> there are use cases that need that. Plus, we may explore an 
>>>>>>>>>>>> approach
>>>>>>>>>>>> similar to MySQL that allows users to reset the default value on 
>>>>>>>>>>>> update.
>>>>>>>>>>>>
>>>>>>>>>>>> - Anton
>>>>>>>>>>>>
>>>>>>>>>>>> вт, 13 січ. 2026 р. о 11:04 Russell Spitzer <
>>>>>>>>>>>> [email protected]> пише:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think this is the right step forward. Our current
>>>>>>>>>>>>> "timestamp" definition is too ambiguous to be useful so 
>>>>>>>>>>>>> establishing
>>>>>>>>>>>>> a well defined and monotonic timestamp could be really great.
>>>>>>>>>>>>> I also like the ability for row's to know this value without
>>>>>>>>>>>>> having to rely on snapshot information which can be expired.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Jan 12, 2026 at 11:03 AM Steven Wu <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have revised the row timestamp proposal with the following
>>>>>>>>>>>>>> changes.
>>>>>>>>>>>>>> * a new commit_timestamp field in snapshot metadata that has
>>>>>>>>>>>>>> nano-second precision.
>>>>>>>>>>>>>> * this optional field is only set by the REST catalog server
>>>>>>>>>>>>>> * it needs to be monotonic (e.g. implemented using Lamport
>>>>>>>>>>>>>> timestamp)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?tab=t.0#heading=h.efdngoizchuh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 2:36 PM Steven Wu <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thanks for the clarification, Ryan.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For long-running streaming jobs that commit periodically, it
>>>>>>>>>>>>>>> is difficult to establish the constant value of 
>>>>>>>>>>>>>>> current_timestamp across
>>>>>>>>>>>>>>> all writer tasks for each commit cycle. I guess streaming 
>>>>>>>>>>>>>>> writers may just
>>>>>>>>>>>>>>> need to write the wall clock time when appending a row to a 
>>>>>>>>>>>>>>> data file for
>>>>>>>>>>>>>>> the default value of current_timestamp.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:44 PM Ryan Blue <[email protected]>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I don't think that every row would have a different value.
>>>>>>>>>>>>>>>> That would be up to the engine, but I would expect engines to 
>>>>>>>>>>>>>>>> insert
>>>>>>>>>>>>>>>> `CURRENT_TIMESTAMP` into the plan and then replace it with a 
>>>>>>>>>>>>>>>> constant,
>>>>>>>>>>>>>>>> resulting in a consistent value for all rows.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You're right that this would not necessarily be the commit
>>>>>>>>>>>>>>>> time. But neither is the commit timestamp from Iceberg's 
>>>>>>>>>>>>>>>> snapshot. I'm not
>>>>>>>>>>>>>>>> sure how we are going to define "good enough" for this 
>>>>>>>>>>>>>>>> purpose. I think at
>>>>>>>>>>>>>>>> least `CURRENT_TIMESTAMP` has reliable and known behavior when 
>>>>>>>>>>>>>>>> you look at
>>>>>>>>>>>>>>>> how it is handled in engines. And if you want the Iceberg 
>>>>>>>>>>>>>>>> timestamp, then
>>>>>>>>>>>>>>>> use a periodic query of the snapshot stable to keep track of 
>>>>>>>>>>>>>>>> them in a
>>>>>>>>>>>>>>>> table you can join to. I don't think this rises to the need 
>>>>>>>>>>>>>>>> for a table
>>>>>>>>>>>>>>>> feature unless we can guarantee that it is correct.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 1:19 PM Steven Wu <
>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > Postgres `current_timestamp` captures the
>>>>>>>>>>>>>>>>> transaction start time [1, 2]. Should we extend the same 
>>>>>>>>>>>>>>>>> semantic to
>>>>>>>>>>>>>>>>> Iceberg: all rows added in the same snapshot should have the 
>>>>>>>>>>>>>>>>> same timestamp
>>>>>>>>>>>>>>>>> value?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Let me clarify my last comment.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> created_at TIMESTAMP WITH TIME ZONE DEFAULT
>>>>>>>>>>>>>>>>> CURRENT_TIMESTAMP)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Since Postgres current_timestamp captures the transaction
>>>>>>>>>>>>>>>>> start time, all rows added in the same insert transaction 
>>>>>>>>>>>>>>>>> would have the
>>>>>>>>>>>>>>>>> same value as the transaction timestamp with the column 
>>>>>>>>>>>>>>>>> definition above.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> If we extend a similar semantic to Iceberg, all rows added
>>>>>>>>>>>>>>>>> in the same Iceberg transaction/snapshot should have the same 
>>>>>>>>>>>>>>>>> timestamp?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Ryan, I understand your comment for using
>>>>>>>>>>>>>>>>> current_timestamp expression as column default value, you 
>>>>>>>>>>>>>>>>> were thinking
>>>>>>>>>>>>>>>>> that the engine would set the column value to the wall clock 
>>>>>>>>>>>>>>>>> time when
>>>>>>>>>>>>>>>>> appending a row to a data file, right? every row would almost 
>>>>>>>>>>>>>>>>> have a
>>>>>>>>>>>>>>>>> different timestamp value.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Fri, Dec 12, 2025 at 10:26 AM Steven Wu <
>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> `current_timestamp` expression may not always carry the
>>>>>>>>>>>>>>>>>> right semantic for the use cases. E.g., latency tracking is 
>>>>>>>>>>>>>>>>>> interested in
>>>>>>>>>>>>>>>>>> when records are added / committed to the table, not when 
>>>>>>>>>>>>>>>>>> the record was
>>>>>>>>>>>>>>>>>> appended to an uncommitted data file in the processing 
>>>>>>>>>>>>>>>>>> engine.
>>>>>>>>>>>>>>>>>> Record creation and Iceberg commit can be minutes or even 
>>>>>>>>>>>>>>>>>> hours apart.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Row timestamp inherited from snapshot timestamp has no
>>>>>>>>>>>>>>>>>> overhead with the initial commit and has very minimal 
>>>>>>>>>>>>>>>>>> storage overhead
>>>>>>>>>>>>>>>>>> during file rewrite. Per-row current_timestamp would have 
>>>>>>>>>>>>>>>>>> distinct values
>>>>>>>>>>>>>>>>>> for every row and has more storage overhead.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> OLTP databases deal with small row-level transactions.
>>>>>>>>>>>>>>>>>> Postgres `current_timestamp` captures the transaction start 
>>>>>>>>>>>>>>>>>> time [1, 2].
>>>>>>>>>>>>>>>>>> Should we extend the same semantic to Iceberg: all rows 
>>>>>>>>>>>>>>>>>> added in the same
>>>>>>>>>>>>>>>>>> snapshot should have the same timestamp value?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>> https://www.postgresql.org/docs/current/functions-datetime.html
>>>>>>>>>>>>>>>>>> [2]
>>>>>>>>>>>>>>>>>> https://neon.com/postgresql/postgresql-date-functions/postgresql-current_timestamp
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 4:07 PM Micah Kornfield <
>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Micah, are 1 and 2 the same? 3 is covered by this
>>>>>>>>>>>>>>>>>>>> proposal.
>>>>>>>>>>>>>>>>>>>> To support the created_by timestamp, we would need to
>>>>>>>>>>>>>>>>>>>> implement the following row lineage behavior
>>>>>>>>>>>>>>>>>>>> * Initially, it inherits from the snapshot timestamp
>>>>>>>>>>>>>>>>>>>> * during rewrite (like compaction), it should be
>>>>>>>>>>>>>>>>>>>> persisted into data files.
>>>>>>>>>>>>>>>>>>>> * during update, it needs to be carried over from the
>>>>>>>>>>>>>>>>>>>> previous row. This is similar to the row_id carry over for 
>>>>>>>>>>>>>>>>>>>> row updates.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Sorry for the short hand.  These are not the same:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> 1.  Insertion time - time the row was inserted.
>>>>>>>>>>>>>>>>>>> 2.  Create by - The system that created the record.
>>>>>>>>>>>>>>>>>>> 3.  Updated by - The system that last updated the record.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Depending on the exact use-case these might or might not
>>>>>>>>>>>>>>>>>>> have utility.  I'm just wondering if there will be more 
>>>>>>>>>>>>>>>>>>> example like this
>>>>>>>>>>>>>>>>>>> in the future.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> created_by column would incur likely significantly
>>>>>>>>>>>>>>>>>>>> higher storage overhead compared to the updated_by column. 
>>>>>>>>>>>>>>>>>>>> As rows are
>>>>>>>>>>>>>>>>>>>> updated overtime, the cardinality for this column in data 
>>>>>>>>>>>>>>>>>>>> files can be
>>>>>>>>>>>>>>>>>>>> high. Hence, the created_by column may not compress well. 
>>>>>>>>>>>>>>>>>>>> This is a similar
>>>>>>>>>>>>>>>>>>>> problem for the row_id column. One side effect of enabling 
>>>>>>>>>>>>>>>>>>>> row lineage by
>>>>>>>>>>>>>>>>>>>> default for V3 tables is the storage overhead of row_id 
>>>>>>>>>>>>>>>>>>>> column after
>>>>>>>>>>>>>>>>>>>> compaction especially for narrow tables with few columns.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I agree.  I think this analysis also shows that some
>>>>>>>>>>>>>>>>>>> consumers of Iceberg might not necessarily want to have all 
>>>>>>>>>>>>>>>>>>> these columns,
>>>>>>>>>>>>>>>>>>> so we might want to make them configurable, rather than 
>>>>>>>>>>>>>>>>>>> mandating them for
>>>>>>>>>>>>>>>>>>> all tables. Ryan's thought on default values seems like it 
>>>>>>>>>>>>>>>>>>> would solve the
>>>>>>>>>>>>>>>>>>> issues I was raising.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Micah
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On Thu, Dec 11, 2025 at 3:47 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set 
>>>>>>>>>>>>>>>>>>>> the column value.
>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Since the use cases don't require an exact timestamp,
>>>>>>>>>>>>>>>>>>>> this seems like the best solution to get what people want 
>>>>>>>>>>>>>>>>>>>> (an insertion
>>>>>>>>>>>>>>>>>>>> timestamp) that has clear and well-defined behavior. Since
>>>>>>>>>>>>>>>>>>>> `current_timestamp` is defined by the SQL spec, it makes 
>>>>>>>>>>>>>>>>>>>> sense to me that
>>>>>>>>>>>>>>>>>>>> we could use it and have reasonable behavior.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> I've talked with Anton about this before and maybe
>>>>>>>>>>>>>>>>>>>> he'll jump in on this thread. I think that we may need to 
>>>>>>>>>>>>>>>>>>>> extend default
>>>>>>>>>>>>>>>>>>>> values to include default value expressions, like 
>>>>>>>>>>>>>>>>>>>> `current_timestamp` that
>>>>>>>>>>>>>>>>>>>> is allowed by the SQL spec. That would solve the problem 
>>>>>>>>>>>>>>>>>>>> as well as some
>>>>>>>>>>>>>>>>>>>> others (like `current_date` or `current_user`) and would 
>>>>>>>>>>>>>>>>>>>> not create a
>>>>>>>>>>>>>>>>>>>> potentially misleading (and heavyweight) timestamp feature 
>>>>>>>>>>>>>>>>>>>> in the format.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> > Also some environments may have stronger clock
>>>>>>>>>>>>>>>>>>>> service, like Spanner TrueTime service.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Even in cases like this, commit retries can reorder
>>>>>>>>>>>>>>>>>>>> commits and make timestamps out of order. I don't think 
>>>>>>>>>>>>>>>>>>>> that we should be
>>>>>>>>>>>>>>>>>>>> making guarantees or even exposing metadata that people 
>>>>>>>>>>>>>>>>>>>> might mistake as
>>>>>>>>>>>>>>>>>>>> having those guarantees.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 2:22 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Ryan, thanks a lot for the feedback!
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Regarding the concern for reliable timestamps, we are
>>>>>>>>>>>>>>>>>>>>> not proposing using timestamps for ordering. With NTP in 
>>>>>>>>>>>>>>>>>>>>> modern computers,
>>>>>>>>>>>>>>>>>>>>> they are generally reliable enough for the intended use 
>>>>>>>>>>>>>>>>>>>>> cases. Also some
>>>>>>>>>>>>>>>>>>>>> environments may have stronger clock service, like Spanner
>>>>>>>>>>>>>>>>>>>>> TrueTime service
>>>>>>>>>>>>>>>>>>>>> <https://docs.cloud.google.com/spanner/docs/true-time-external-consistency>
>>>>>>>>>>>>>>>>>>>>> .
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> >  joining to timestamps from the snapshots metadata
>>>>>>>>>>>>>>>>>>>>> table.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> As you also mentioned, it depends on the snapshot
>>>>>>>>>>>>>>>>>>>>> history, which is often retained for a few days due to 
>>>>>>>>>>>>>>>>>>>>> performance reasons.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > embedding a timestamp in DML (like
>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit 
>>>>>>>>>>>>>>>>>>>>> one from table
>>>>>>>>>>>>>>>>>>>>> metadata.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> An explicit timestamp column adds more burden to
>>>>>>>>>>>>>>>>>>>>> application developers. While some databases require an 
>>>>>>>>>>>>>>>>>>>>> explicit column in
>>>>>>>>>>>>>>>>>>>>> the schema, those databases provide triggers to auto set 
>>>>>>>>>>>>>>>>>>>>> the column value.
>>>>>>>>>>>>>>>>>>>>> For Iceberg, the snapshot timestamp is the closest to the 
>>>>>>>>>>>>>>>>>>>>> trigger timestamp.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Also, the timestamp set during computation (like
>>>>>>>>>>>>>>>>>>>>> streaming ingestion or relative long batch computation) 
>>>>>>>>>>>>>>>>>>>>> doesn't capture the
>>>>>>>>>>>>>>>>>>>>> time the rows/files are added to the Iceberg table in a 
>>>>>>>>>>>>>>>>>>>>> batch fashion.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> > And for those use cases, you could also keep a
>>>>>>>>>>>>>>>>>>>>> longer history of snapshot timestamps, like storing a 
>>>>>>>>>>>>>>>>>>>>> catalog's event log
>>>>>>>>>>>>>>>>>>>>> for long-term access to timestamp info
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> this is not really consumable by joining the regular
>>>>>>>>>>>>>>>>>>>>> table query with catalog event log. I would also imagine 
>>>>>>>>>>>>>>>>>>>>> catalog event log
>>>>>>>>>>>>>>>>>>>>> is capped at shorter retention (maybe a few months) 
>>>>>>>>>>>>>>>>>>>>> compared to data
>>>>>>>>>>>>>>>>>>>>> retention (could be a few years).
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Tue, Dec 9, 2025 at 1:32 PM Ryan Blue <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> I don't think it is a good idea to expose timestamps
>>>>>>>>>>>>>>>>>>>>>> at the row level. Timestamps in metadata that would be 
>>>>>>>>>>>>>>>>>>>>>> carried down to the
>>>>>>>>>>>>>>>>>>>>>> row level already confuse people that expect them to be 
>>>>>>>>>>>>>>>>>>>>>> useful or reliable,
>>>>>>>>>>>>>>>>>>>>>> rather than for debugging. I think extending this to the 
>>>>>>>>>>>>>>>>>>>>>> row level would
>>>>>>>>>>>>>>>>>>>>>> only make the problem worse.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> You can already get this information by projecting
>>>>>>>>>>>>>>>>>>>>>> the last updated sequence number, which is reliable, and 
>>>>>>>>>>>>>>>>>>>>>> joining to
>>>>>>>>>>>>>>>>>>>>>> timestamps from the snapshots metadata table. Of course, 
>>>>>>>>>>>>>>>>>>>>>> the drawback there
>>>>>>>>>>>>>>>>>>>>>> is losing the timestamp information when snapshots 
>>>>>>>>>>>>>>>>>>>>>> expire, but since it
>>>>>>>>>>>>>>>>>>>>>> isn't reliable anyway I'd be fine with that.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Some of the use cases, like auditing and compliance,
>>>>>>>>>>>>>>>>>>>>>> are probably better served by embedding a timestamp in 
>>>>>>>>>>>>>>>>>>>>>> DML (like
>>>>>>>>>>>>>>>>>>>>>> `current_timestamp`) rather than relying on an implicit 
>>>>>>>>>>>>>>>>>>>>>> one from table
>>>>>>>>>>>>>>>>>>>>>> metadata. And for those use cases, you could also keep a 
>>>>>>>>>>>>>>>>>>>>>> longer history of
>>>>>>>>>>>>>>>>>>>>>> snapshot timestamps, like storing a catalog's event log 
>>>>>>>>>>>>>>>>>>>>>> for long-term
>>>>>>>>>>>>>>>>>>>>>> access to timestamp info. I think that would be better 
>>>>>>>>>>>>>>>>>>>>>> than storing it at
>>>>>>>>>>>>>>>>>>>>>> the row level.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Mon, Dec 8, 2025 at 3:46 PM Steven Wu <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> For V4 spec, I have a small proposal [1] to expose
>>>>>>>>>>>>>>>>>>>>>>> the row timestamp concept that can help with many use 
>>>>>>>>>>>>>>>>>>>>>>> cases like temporal
>>>>>>>>>>>>>>>>>>>>>>> queries, latency tracking, TTL, auditing and compliance.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> This *_last_updated_timestamp_ms * metadata column
>>>>>>>>>>>>>>>>>>>>>>> behaves very similarly to the
>>>>>>>>>>>>>>>>>>>>>>> *_last_updated_sequence_number* for row lineage.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>    - Initially, it inherits from the snapshot
>>>>>>>>>>>>>>>>>>>>>>>    timestamp.
>>>>>>>>>>>>>>>>>>>>>>>    - During rewrite (like compaction), its values
>>>>>>>>>>>>>>>>>>>>>>>    are persisted in the data files.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Would love to hear what you think.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>>> Steven
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> [1]
>>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1cXr_RwEO6o66S8vR7k3NM8-bJ9tH2rkh4vSdMXNC8J8/edit?usp=sharing
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] row timestamp proposal

Reply via email to