gengliangwang commented on code in PR #55776:
URL: https://github.com/apache/spark/pull/55776#discussion_r3221031507


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java:
##########
@@ -71,10 +71,21 @@
  * </ul>
  * <p>
  * Streaming reads support carry-over removal, update detection, and net change
- * computation. Net change collapses are kept in the state store keyed by row 
identity;
- * row identities only touched in the latest observed commit are held back 
until either a
- * later commit (with strictly greater `_commit_timestamp`) advances the 
global watermark
- * past them, or the source terminates.
+ * computation. Two streaming-specific behaviors to be aware of:
+ * <ul>
+ *   <li><b>Output is delayed by one commit.</b> When a micro-batch ingests a

Review Comment:
   Good catch -- updated in 0be721e. Replaced "delayed by one commit" / "by the 
next micro-batch" with "buffered until the watermark advances past the commit" 
/ "by a later micro-batch -- whichever one advances the watermark past the 
commit's `_commit_timestamp`". This anchors the lag on watermark progression 
rather than commit count, so the multiple-distinct-commits-per-batch case is no 
longer implicitly excluded.



##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java:
##########
@@ -71,10 +71,21 @@
  * </ul>
  * <p>
  * Streaming reads support carry-over removal, update detection, and net change
- * computation. Net change collapses are kept in the state store keyed by row 
identity;
- * row identities only touched in the latest observed commit are held back 
until either a
- * later commit (with strictly greater `_commit_timestamp`) advances the 
global watermark
- * past them, or the source terminates.
+ * computation. Two streaming-specific behaviors to be aware of:
+ * <ul>
+ *   <li><b>Output is delayed by one commit.</b> When a micro-batch ingests a
+ *       commit, that commit's output rows are buffered and not emitted in the
+ *       same batch. They are emitted by the next micro-batch -- the one that
+ *       ingests the following commit. The last commit's output is emitted
+ *       when the source terminates.</li>
+ *   <li><b>netChanges only merges changes that are buffered together.</b> For

Review Comment:
   Adopted your framing in 0be721e -- the bullet now says "When each row 
identity appears in at most one commit within any buffered window, the 
streaming output is the same as `computeUpdates`." The condition is now on 
row-identity occurrences within the buffer window rather than on changes within 
a single commit, which matches what `CdcNetChangesStatefulProcessor` actually 
does. Thanks.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to