github-actions[bot] commented on code in PR #64301:
URL: https://github.com/apache/doris/pull/64301#discussion_r3380136352


##########
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingMultiTblTask.java:
##########
@@ -84,6 +85,8 @@ public class StreamingMultiTblTask extends 
AbstractStreamingTask {
     private long filteredRows = 0L;
     private long loadedRows = 0L;
     private long runningBackendId;
+    transient long lastScannedRows = -1;
+    transient long lastProgressMs = 0;
 

Review Comment:
   `runningStreamTask` is persisted in `StreamingInsertJob`, but 
`lastProgressMs` is transient and is not restored in `gsonPostProcess`. After 
FE restart or master failover, a RUNNING multi-table task can be replayed with 
`startTimeMs` restored and `lastProgressMs == 0`; then `processTimeoutTasks()` 
calls `isTimeout()`, and if `/api/getProgress` returns null or does not 
advance, `elapsed = now - 0` immediately exceeds the timeout and the task is 
failed/rescheduled. Please initialize this field after replay, or make 
`isTimeout` fall back to `startTimeMs` when `lastProgressMs` is unset, and add 
coverage for the replayed-running-task case.



##########
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingMultiTblTask.java:
##########
@@ -362,22 +366,65 @@ private String getFrontendAddress() {
     }
 
     public boolean isTimeout() {
+        return isTimeout(fetchProgress());
+    }

Review Comment:
   `StreamingInsertJob.processTimeoutTasks()` holds the job write lock while 
calling `runningMultiTask.isTimeout()`. With this change, every timeout check 
now performs a backend RPC and waits up to 
`streaming_cdc_light_rpc_timeout_sec` (default 90s) before returning. While 
that wait is in progress, `commitOffset()` and other paths needing the same 
write lock are blocked, so a slow or stuck cdc_client progress RPC can prevent 
a successful task callback from acquiring the lock and can make the task 
fail/reschedule despite actual progress. Please avoid doing this RPC under the 
job lock, for example by fetching progress before acquiring the write lock, 
using cached/asynchronous progress, or otherwise making the check 
non-blocking/very short while preserving the task identity check.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to