carloea2 commented on issue #4058:
URL: https://github.com/apache/texera/issues/4058#issuecomment-3573607898
After talking with @chenlica I realized that even with backend checks added
for `uploadOneFileToDataset`, the **multipart + presigned URL path** still has
a gap: a user could already have uploaded e.g. 1 TB to lakeFS/S3, and we only
notice at `multipartUpload?type=finish`, which is too late from a bandwidth /
resource perspective.
### Why we can’t enforce inline like single-file upload
For single-file uploads, the HTTP body streams **through** `DatasetResource`
and we can do:
```scala
totalBytesRead += read
if (totalBytesRead > maxLimit) throw 413
```
For multipart uploads with **presigned PUTs**:
* The browser talks **directly** to S3/lakeFS (S3 multipart API: initiate →
upload part → complete).
* Our backend only:
* Creates presigned URLs (`type=init`), and
* Gets metadata on `type=finish`.
* The data stream never passes our JVM, so we have **no byte counter to
increment** and no way to send a 413 mid-stream.
* S3/lakeFS don’t expose an app-specific size cap inside presigned **PUT**
URLs; POST policies with `content-length-range` don’t apply to our current
multipart flow.
So an “inline” limit identical to `uploadOneFileToDataset` is fundamentally
impossible with presigned multipart PUTs.
### Proposed direction: server-side watcher
Idea (not implemented yet; looking for feasibility feedback):
* Add a small DB table tracking active multipart uploads: `(upload_id, did,
uid, repository_name, file_path, physical_address, status, bytes_uploaded,
max_bytes, timestamps)`.
* On `multipart-upload?type=init`, insert a row with `status = ACTIVE` and
`max_bytes = single_file_upload_max_size_mib`.
* Background watcher:
* Periodically (e.g. every 10–60s) calls `ListParts` (or lakeFS
equivalent) for each ACTIVE upload and sums **real part sizes** from the object
store.
* Updates `bytes_uploaded`.
* If `bytes_uploaded > max_bytes`, calls `abortPresignedMultipartUploads`
and marks `status = LIMIT_EXCEEDED`.
* On `type=finish`, if status is `LIMIT_EXCEEDED` or `ABORTED` we fail
immediately instead of completing the upload.
This does **not** trust any client-supplied sizes; it only trusts lakeFS/S3
metadata.
### Downsides / tradeoffs
* **Overshoot is still possible**: if a client uploads many small parts
quickly between watcher ticks, we only detect after the interval. The overshoot
is bounded by:
* part size × “parts uploaded between checks” + watcher latency.
* Shorter intervals reduce overshoot but increase control-plane calls
(`ListParts` per active upload).
* **Extra complexity**: background job, DB tracking, and a dependency on
S3/lakeFS metadata being available and fast.
* **Failure modes**: if the watcher is down, we still catch oversized
uploads at `finish` (object size check + rollback), but we can’t avoid
bandwidth cost.
### Future benefit
If we go this route, the same table + watcher make it much easier later to
add **per-user / per-dataset quotas** (e.g., total in-flight bytes, per-user
caps). I haven’t started coding this yet because I’d like feedback on whether
this approach sounds acceptable from a design / ops standpoint.
### Tiny diagram
```mermaid
sequenceDiagram
participant Client
participant Texera as DatasetResource
participant Watcher
participant LakeFS
Client->>Texera: multipart-upload?type=init
Texera->>LakeFS: initiateMultipart
LakeFS-->>Texera: uploadId + presigned URLs
Texera->>DB: INSERT upload (ACTIVE, max_bytes)
Texera-->>Client: uploadId + URLs
Client->>LakeFS: PUT parts (direct)
loop every N seconds
Watcher->>DB: SELECT ACTIVE uploads
Watcher->>LakeFS: ListParts(uploadId)
LakeFS-->>Watcher: part sizes
Watcher->>DB: UPDATE bytes_uploaded
alt bytes_uploaded > max_bytes
Watcher->>LakeFS: abortMultipartUpload()
Watcher->>DB: status = LIMIT_EXCEEDED
end
end
Client->>Texera: multipart-upload?type=finish
Texera->>DB: read status
alt LIMIT_EXCEEDED
Texera-->>Client: 413 Payload Too Large
else
Texera->>LakeFS: completeMultipartUpload()
LakeFS-->>Texera: final size
Texera-->>Client: 200 OK
end
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]