gengliangwang commented on code in PR #55637:
URL: https://github.com/apache/spark/pull/55637#discussion_r3211843519


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java:
##########
@@ -33,8 +33,12 @@
  * <ul>
  *   <li>{@code _change_type} (STRING) — the kind of change: {@code insert}, 
{@code delete},
  *       {@code update_preimage}, or {@code update_postimage}</li>
- *   <li>{@code _commit_version} (connector-defined type, e.g. LONG) — the 
version containing
- *       this change</li>
+ *   <li>{@code _commit_version} — the commit version containing this change. 
Must be of
+ *       an atomic orderable type (e.g. {@code LongType}, {@code StringType},

Review Comment:
   Sorting by `_commit_version` is intentional — the batch netChanges path that 
already shipped has the same requirement. NetChanges relies on first/last 
extraction across commits to evaluate the `(existedBefore, existsAfter)` SPIP 
matrix, and `_commit_timestamp` is allowed to tie across commits (multiple 
commits can share a microsecond). Using it as the order key would silently 
corrupt the collapse — e.g. an `insert` at v5 and a `delete` at v6 with the 
same timestamp could classify as `(insert, delete) → cancel` or `(delete, 
insert) → emit pre+post` depending on tie-breaking. `_commit_version` is the 
only monotonic identifier the CDC contract guarantees.
   
   `_commit_version` is `@Evolving` in 4.2 — happy to relax this if we find a 
cleaner way to disambiguate same-timestamp commits in a later release.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to