Re: [I] Dataset max file size limit can be bypassed (frontend-only check) [texera]

via GitHub Tue, 25 Nov 2025 15:53:29 -0800


carloea2 commented on issue #4058:
URL: https://github.com/apache/texera/issues/4058#issuecomment-3578098518


   **1. Concrete bypass evidence**
   On hub.texera.io with `singleFileUploadMaxSizeMiB = 10 GiB`, I patched the 
JS size check in devtools and uploaded a **12.92 GB** `testtexera.zip`. The 
upload used multipart + presigned PUTs, `multipartUpload?type=finish` returned 
200, and the dataset shows `Version Size: 12.92 GB`. So a user who bypasses the 
frontend can exceed the configured limit today.
   
   **2. POST, part limits, and URL count**
   Presigned POST `content-length-range` only protects a *single* request, 
while our large-file path uses multipart PUTs. lakeFS/S3 enforce only 
**per-part** bounds (e.g., 5 MiB–5 GiB, up to 10,000 parts). Because parts can 
vary in size, limiting the **number of presigned URLs** alone does **not** 
enforce a total max size. Relying on a “final size” header from the client 
would again trust the frontend.
   
   **3. FixedSize / MaxSize parts**
   As far as I can see, we can’t tell lakeFS/S3 “every part must be exactly X 
bytes”; they only enforce min/max per part. So a “fixed-size parts = enforced 
limit” scheme isn’t reliable without extra server logic. So yes, using 
FixedSize/MaxSize parts alone doesn’t seem feasible.
   
   **4. Finish-time check (simple backend fix)**
   First step I propose: on `multipartUpload?type=finish`, read the object size 
from lakeFS/S3 and reject anything over `singleFileUploadMaxSizeMiB` (and 
optionally delete/abort). This trusts only object-store metadata. Downside: we 
discover violations *after* all bytes are uploaded, so bandwidth and temporary 
storage are still consumed.
   
   **5. Watcher approach (pros/cons)**
   A watcher that periodically `ListParts` and aborts when `bytes_uploaded > 
limit` would:
   
   * **Pros:** detect oversize uploads earlier; natural place for future 
per-user/dataset quotas.
   * **Cons:** extra DB table + background job, more control-plane calls, new 
failure modes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Dataset max file size limit can be bypassed (frontend-only check) [texera]

Reply via email to