GitHub user xuang7 created a discussion: Improve resumable upload: track 
completion at the batch/session level

### Context
During recent large-dataset upload testing, we ran into an issue that may be 
worth discussing as a group, since the solution may affect both the upload and 
dataset commit behavior.

A user uploaded ~1300 files (~1.3 TB total) into a dataset. After the upload 
"finished", only ~1200 files were uploaded; ~100 failed mid-process.

This exposed two usability issues:
1. No visibility into what failed:  There's no clear report of which files are 
missing, so the user can't tell what still needs uploading.
2. Re-dragging the entire folder wastes huge amounts of time and bandwidth: 
Resumability today is file-level only. When the user re-drops the same 
directory:
   - the ~100 failed files resume correctly, but
   - the ~1200 already-uploaded files are treated as new uploads and re-sent in 
full.

Currently, resumability has no batch/session-level state. We do not track which 
files in the batch have already completed, so re-dropping the same directory 
cannot distinguish "already uploaded" files from "new" files.

**There are two categories of files that get uploaded again today:**
- Committed files: These files already exist in a dataset version, but the 
upload path does not check for them before uploading. They are re-uploaded into 
LakeFS staging and deduplicated later during createDatasetVersion, which diffs 
the branch. Because LakeFS is content-addressed, the re-sent bytes match the 
committed checksum and do not appear in the final diff, so they are not 
committed again.
- Uncommitted files: These files finished uploading but have not yet been 
committed. The upload session row is deleted after finishMultipartUpload, so a 
completed file becomes invisible.

### How other systems handle this
In some other platforms, when uploading a folder that contains files already 
uploaded before, the system detects the existing files and prompts the user 
with options such as:
- Replace existing files (upload becomes a new version)
- Keep existing files (for example by creating data(1).txt)
- Skip already uploaded files

<img width="500" alt="refer" 
src="https://github.com/user-attachments/assets/e5d50559-109a-439b-9285-643f5a33b340";
 />

However, some platforms still seem to re-upload the existing files instead of 
truly skipping them, which is effectively what we do today.

### Proposed Solution
Move existence detection to the beginning of the upload flow instead of relying 
on commit-time deduplication.
1. Detect files that already exist before uploading against both states:
   - Committed in the target dataset (LakeFS object listing), and uncommitted 
on the branch (LakeFS staging/uncommitted listing).
   - Use a lightweight existence check, such as path and file size.
2. When existing files are detected, ask users to confirm whether they want to 
skip them before continuing the upload. (Future work could explore support for 
replace/overwrite)
3. Avoid re-uploading files that are already successfully uploaded when the 
user chooses to skip them.
4. Resume only the missing or failed files/parts after an interrupted upload.

This would be especially useful for large datasets, where re-uploading already 
completed files can waste a lot of time and bandwidth.

Please feel free to share any suggestions or concerns. Thanks!

GitHub link: https://github.com/apache/texera/discussions/5744

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to