GitHub user xuang7 created a discussion: Improve resumable upload: track completion at the batch/session level
### Context During recent large-dataset upload testing, we ran into an issue that may be worth discussing as a group, since the solution may affect both the upload and dataset commit behavior. A user uploaded ~1300 files (~1.3 TB total) into a dataset. After the upload "finished", only ~1200 files were uploaded; ~100 failed mid-process. This exposed two usability issues: 1. No visibility into what failed: There's no clear report of which files are missing, so the user can't tell what still needs uploading. 2. Re-dragging the entire folder wastes huge amounts of time and bandwidth: Resumability today is file-level only. When the user re-drops the same directory: - the ~100 failed files resume correctly, but - the ~1200 already-uploaded files are treated as new uploads and re-sent in full. Currently, resumability has no batch/session-level state. We do not track which files in the batch have already completed, so re-dropping the same directory cannot distinguish "already uploaded" files from "new" files. **There are two categories of files that get uploaded again today:** - Committed files: These files already exist in a dataset version, but the upload path does not check for them before uploading. They are re-uploaded into LakeFS staging and deduplicated later during createDatasetVersion, which diffs the branch. Because LakeFS is content-addressed, the re-sent bytes match the committed checksum and do not appear in the final diff, so they are not committed again. - Uncommitted files: These files finished uploading but have not yet been committed. The upload session row is deleted after finishMultipartUpload, so a completed file becomes invisible. ### How other systems handle this In some other platforms, when uploading a folder that contains files already uploaded before, the system detects the existing files and prompts the user with options such as: - Replace existing files (upload becomes a new version) - Keep existing files (for example by creating data(1).txt) - Skip already uploaded files <img width="500" alt="refer" src="https://github.com/user-attachments/assets/e5d50559-109a-439b-9285-643f5a33b340" /> However, some platforms still seem to re-upload the existing files instead of truly skipping them, which is effectively what we do today. ### Proposed Solution Move existence detection to the beginning of the upload flow instead of relying on commit-time deduplication. 1. Detect files that already exist before uploading against both states: - Committed in the target dataset (LakeFS object listing), and uncommitted on the branch (LakeFS staging/uncommitted listing). - Use a lightweight existence check, such as path and file size. 2. When existing files are detected, ask users to confirm whether they want to skip them before continuing the upload. (Future work could explore support for replace/overwrite) 3. Avoid re-uploading files that are already successfully uploaded when the user chooses to skip them. 4. Resume only the missing or failed files/parts after an interrupted upload. This would be especially useful for large datasets, where re-uploading already completed files can waste a lot of time and bandwidth. Please feel free to share any suggestions or concerns. Thanks! GitHub link: https://github.com/apache/texera/discussions/5744 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
