eugenegujing commented on issue #5144:
URL: https://github.com/apache/texera/issues/5144#issuecomment-4522531300

   Thank you @Yicong-Huang and @xuang7 for help! Xuan's Problem 2 really gives 
me some hints for this problem. Here is my update to this problem:
   
   Uploading the 179 GB `.h5` file finally succeeded only after I raised Part 
Size to 5120 MiB (~36 parts total). 50 / 100 / 200 / 300 / 400 / 500 / 1000 MiB 
all failed mid-upload.
   
   This is my explanation to the problem. It might be incorrect and welcome to 
discuss:
   
   It's a statistical failure tied to part count, not a hard "100 GB" 
threshold. Each XHR for a part has some small chance of hitting a transient 
timeout. Crucially, the frontend has no per-part retry — in 
`dataset.service.ts`, a single failed XHR makes the surrounding `mergeMap` 
abort the whole upload. So a 0.5% per-part failure rate for a 179GB file means:
   
   - 50 MiB parts → 3666 parts → ~10⁻⁸ chance of full success → ~always fails
   - 1000 MiB parts → 184 parts → ~40% chance of success → sometimes works
   - 5120 MiB parts → 36 parts → ~83% chance of success → works for me
   
   Re @xuang7's question 1: it should fail less often with smaller files purely 
because there are fewer parts. I'd guess 1 GB and 10 GB succeed reliably at 50 
MiB, 50 GB starts being flaky, 100 GB+ almost always fails.
   
   Re @xuang7's question 3: from the original screenshot, only specific parts 
time out (`partNumber=6, 90, 115, 160`), not all of them.
   
   Re @xuang7's reproduction with a sparse file: that probably hides the issue, 
because sparse-file reads are nearly free , so the client feeds chunks at a 
perfectly smooth rate. With a real .h5 file there's actual disk I/O jitter, 
which makes some XHRs slower and more likely to hit one of the idle timeouts 
above.
   
   Re @Yicong-Huang's points: I don't think this is a LakeFS bug. LakeFS only 
sets up the upload session and finalizes the merge, so per-part timeouts can't 
originate there. And I think OS/env matters: client-side I/O jitter raises 
per-part failure probability.
   
   Suggested fixes:
   
   1. **Add per-part retry with backoff** in `dataset.service.ts` around the 
inner observable — this alone would let 50 MiB defaults work reliably even for 
200 GB.
   2. **Dynamic part size based on file size.** In 
`dataset-detail.component.ts`, treat the admin-configured part size as a floor 
and auto-scale it up so the part count stays bounded (e.g. ≤ 200 parts)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to