eugenegujing commented on PR #5643: URL: https://github.com/apache/texera/pull/5643#issuecomment-4701687366
@Yicong-Huang Thank you for the detailed review. It was really helpful. I've tried to address all three points in the latest commit. Here's what I did, please let me know if I've misunderstood anything. **On combining the steps in a transaction (comments 1 & 3):** I've wrapped Path 1 so that, for each session, the row delete and the multipart abort run in one `SqlServer.withTransaction` block, with the delete staged first. My thinking is that if the abort then fails, the transaction rolls back and the row survives, so we shouldn't end up in the partial state you described (record gone but the LakeFS entry left behind), so it'd just be retried next round. Please correct me if there's a case I'm missing here. One thing I wasn't sure how to fully solve: since LakeFS is an external service, I don't think it can really take part in a Postgres transaction (a DB rollback can't undo an abort that already went through), so I couldn't get true all-or-nothing across both systems. What I did instead was lean on ordering + idempotency. The DB write is gated on the abort, and the abort seems to be idempotent (re-aborting an already-aborted upload returns 404, which we treat as success), so in my understanding a successful abort followed by a later failure should self-heal on the next round. I also kept the transaction per-session rather than per-round, mainly so we don't hold a DB connection open across many LakeFS calls, but I'm happy to change the granularity if you'd prefer. For Path 2 (resetting uncommitted objects) there's no DB write, so I didn't see anything to wrap there; it also runs as an independent sweep that doesn't rely on the session records, which I think gives us a second chance t o clean any orphaned object even if its row is already gone. **On testing the different step failures (comment 2):** I added tests for each case you mentioned, checking the result is either rolled back or cleaned next round: no DB record (orphan / null-`repository_name` session), no file found (already-aborted 404 and repo-level 404), and a timeout / generic error (non-404 abort failure → delete rolled back, row survives, retried). I also added one for "cleaned on the next round" (a transient failure self-healing) and one for failure isolation (a failing item not blocking a healthy one in the same round). The only one I couldn't reproduce deterministically is the per-object 404 in Path 2. It's handled the same way in code, but I couldn't trigger it in a container test without faking the client, so I left a comment pointing to the equivalent repo-level 404 test instead. Let me know if you'd like me to cover that differently. Thanks again for that and I'm happy to adjust any of this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
