gengliangwang commented on code in PR #55637:
URL: https://github.com/apache/spark/pull/55637#discussion_r3211843519
##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java:
##########
@@ -33,8 +33,12 @@
* <ul>
* <li>{@code _change_type} (STRING) — the kind of change: {@code insert},
{@code delete},
* {@code update_preimage}, or {@code update_postimage}</li>
- * <li>{@code _commit_version} (connector-defined type, e.g. LONG) — the
version containing
- * this change</li>
+ * <li>{@code _commit_version} — the commit version containing this change.
Must be of
+ * an atomic orderable type (e.g. {@code LongType}, {@code StringType},
Review Comment:
Sorting by `_commit_version` is intentional — the batch netChanges path that
already shipped has the same requirement. NetChanges relies on first/last
extraction across commits to evaluate the `(existedBefore, existsAfter)` SPIP
matrix, and `_commit_timestamp` is allowed to tie across commits (multiple
commits can share a microsecond). Using it as the order key would silently
corrupt the collapse — e.g. an `insert` at v5 and a `delete` at v6 with the
same timestamp could classify as `(insert, delete) → cancel` or `(delete,
insert) → emit pre+post` depending on tie-breaking. `_commit_version` is the
only monotonic identifier the CDC contract guarantees.
`_commit_version` is `@Evolving` in 4.2 — happy to relax this if we find a
cleaner way to disambiguate same-timestamp commits in a later release.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]