viirya commented on code in PR #55776:
URL: https://github.com/apache/spark/pull/55776#discussion_r3220639571


##########
sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/Changelog.java:
##########
@@ -71,10 +71,21 @@
  * </ul>
  * <p>
  * Streaming reads support carry-over removal, update detection, and net change
- * computation. Net change collapses are kept in the state store keyed by row 
identity;
- * row identities only touched in the latest observed commit are held back 
until either a
- * later commit (with strictly greater `_commit_timestamp`) advances the 
global watermark
- * past them, or the source terminates.
+ * computation. Two streaming-specific behaviors to be aware of:
+ * <ul>
+ *   <li><b>Output is delayed by one commit.</b> When a micro-batch ingests a
+ *       commit, that commit's output rows are buffered and not emitted in the
+ *       same batch. They are emitted by the next micro-batch -- the one that
+ *       ingests the following commit. The last commit's output is emitted
+ *       when the source terminates.</li>
+ *   <li><b>netChanges only merges changes that are buffered together.</b> For

Review Comment:
   The condition in “For a typical CDC source that produces at most one change 
per row per commit ... the streaming output is the same as computeUpdates” is 
not sufficient. Producing at most one change per row per commit only rules out 
multiple changes for the same row within a single commit; it does not guarantee 
that the same row will not be touched again by a later commit before the older 
buffered output has been emitted. The implementation keeps the first and last 
event in CdcNetChangesStatefulProcessor until the event-time timer fires and 
clears the state, so changes from multiple commits can still be merged whenever 
they fall into the same buffered window. Consider rewording this to say that 
streaming netChanges behaves like computeUpdates only when each row identity 
appears in at most one commit within any buffered window.
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to