Yicong-Huang opened a new issue, #4679:
URL: https://github.com/apache/texera/issues/4679

   ### Task Summary
   
   Two e2e specs — `DataProcessingSpec` and `ReconfigurationSpec` — sleep 
1000ms after the workflow reaches `COMPLETED` and before reading the result 
document. The sleep is annotated with a TODO that calls it a workaround for 
"the issue of reporting `completed` status too early":
   
   ```scala
   .map(terminalOpId => {
     //TODO: remove the delay after fixing the issue of reporting "completed" 
status too early.
     Thread.sleep(1000)
     val uri = getResultUriByLogicalPortId(...)
     ...
   })
   ```
   
   The sleep no longer matches the engine's behavior. Tracing the path that 
emits `COMPLETED`:
   
   1. `DataProcessor.outputOneTuple` handles a `FinalizePort(portId, 
input=false)` by calling 
`outputManager.closeOutputStorageWriterIfNeeded(portId)`.
   2. `closeOutputStorageWriterIfNeeded` puts a terminate signal on the writer 
thread's queue **and `join()`s the thread**.
   3. The writer thread, on terminate, calls `IcebergTableWriter.close()` → 
`flushBuffer()` → `table.newAppend().appendFile(dataFile).commit()` 
(synchronous).
   4. Only after the join returns does `DataProcessor` fire `portCompleted` to 
the controller.
   5. After all ports complete and the worker fires `workerExecutionCompleted`, 
the controller checks workflow-terminal and only then sends 
`ExecutionStateUpdate(COMPLETED)` to the client.
   6. On the read side, `IcebergDocument.get()` calls `seekToUsableFile()`, 
which `table.refresh()`es before scanning.
   
   Every step is synchronous or blocking on the writer; there is no observable 
race for the test reader to hit. The sleep is leftover from an earlier engine 
version where commit was not synchronous.
   
   ### Local validation
   
   Removed the sleep from both files and ran the two specs together 5×:
   
   | run | tests | wall-clock |
   |---|---|---|
   | 1 | 21/21 ✅ | 39.2 s |
   | 2 | 21/21 ✅ | 39.5 s |
   | 3 | 21/21 ✅ | 43.2 s |
   | 4 | 21/21 ✅ | 37.3 s |
   | 5 | 21/21 ✅ | 36.9 s |
   
   Compared with the same two specs on the same machine before the change:
   
   | spec | before | after | delta |
   |---|---|---|---|
   | DataProcessingSpec  | 36.5 s / 16 tests | 16.9 s / 16 tests | -19.6 s |
   | ReconfigurationSpec | 27.2 s / 5 tests  | 18.7 s / 5 tests  | -8.5 s  |
   
   The 16-test DataProcessingSpec saves more than 16 × 1 s because the sleep 
was inside a per-terminal-port `.map(...)` block — workflows with multiple 
terminal ports paid the cost more than once per test.
   
   ### Out of scope
   
   - The deeper "premature COMPLETED" claim from the TODO is not reproduced by 
today's engine code path. If a real flake recurs, this issue can be re-opened 
with a stack trace; in the meantime, the sleep is dead workaround code.
   - A separate cleanup target is the `withRetry { super.withFixture(test) }` 
pattern these same two specs (and `PauseSpec`) carry without any test being 
tagged `Retryable` — currently a no-op.
   
   ### Task Type
   
   - [x] Refactor / Cleanup
   - [x] Testing / QA
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to