AnzhiZhang commented on issue #4110:
URL: https://github.com/apache/texera/issues/4110#issuecomment-4035909773

   I came across this issue from the 1.1.0-incubating RC4 release mailing list. 
Before sharing some thoughts, I want to confirm my understanding of the change:
   
   - The multipart upload flow is being changed from client-direct-to-MinIO 
(via presigned URLs) to fully server-proxied through File Service
   - A new DB schema is introduced to track upload session state and abort 
uploads when the cumulative size exceeds the limit
   - The stated motivations are:
     - Enforce file size limits that cannot be bypassed by modifying frontend 
code
     - Avoid calling `listParts()`
     - Abort oversized uploads mid-stream
   
   A few questions and concerns, in case they weren't already addressed in 
earlier discussions:
   
   **1. The threat model**
   
   Users who can upload files are authenticated. In an organization-internal 
deployment, most users can reasonably be assumed trusted, making this attack 
vector relatively low-motivation. For publicly open deployments, please see the 
discussion below.
   
   **2. Alternative mitigations for the size bypass**
   
   The size limit bypass described in #4058 is a valid concern. However, it 
seems the same goal could be achieved without proxying bytes through the server:
   
   - Requiring clients to declare `totalSizeBytes` at init time and validating 
it server-side
   - Configuring a MinIO lifecycle policy to automatically abort incomplete 
multipart uploads after a timeout
   - After the multipart upload is completed, check the actual size and delete 
the object if it exceeds the limit
   
   With this approach, even if a user bypasses the init-time check, the file 
would still be deleted after upload completes. The remaining risk is temporary 
resource exhaustion, which can be handled through user quotas and rate limiting 
on starting upload.
   
   It's also worth noting that bypassing the presigned URL flow requires 
writing custom client code, this is deliberate malicious behavior from an 
authenticated user, which may warrant a different risk assessment. 
Additionally, while `content-length-range` cannot restrict individual multipart 
parts, S3 bucket policies support the `s3:content-length` condition key, which 
may offer another enforcement point. It may also be possible to cap the total 
upload size simply by limiting the number of presigned URLs issued. For 
example, at 50 MB per part, issuing at most 200 URLs would effectively cap 
uploads at 10 GB. Whether MinIO supports these policy conditions would need 
verification, but seems worth exploring.
   
   **3. On avoiding `listParts()`**
   
   It's not clear why avoiding `listParts()` is listed as a benefit. 
`listParts()` or getting file stats is a straightforward way to verify per-part 
state and the true file size. The overhead of either call is negligible 
compared to the cost of routing all upload traffic through File Service.
   
   **4. Server performance under the new design**
   
   Proxying byte streams through File Service will consume server CPU, memory, 
and network bandwidth, especially with the frontend supporting up to 10 
concurrent chunks per file and a finite server thread pool. Heavy concurrent 
uploads could potentially degrade performance for other users. The purpose of 
offloading to MinIO with presigned URLs is precisely to separate file transfer 
from application logic; the new design reintroduces that coupling and adds a 
server communication hop that previously did not exist. The issue doesn't seem 
to discuss this tradeoff in the context of issues.
   
   **5. Is this a common pattern for this use case?**
   
   Server-proxied uploads are typically used for real-time content scanning, 
anonymous uploads, or when the existence of the object store needs to be hidden 
from clients. It's not obvious that Texera's dataset upload falls into any of 
these categories.
   
   **6. On the watcher approach from #4058**
   
   As for the "watcher" idea mentioned in #4058: since the backend already 
holds the `uploadId` for each in-progress upload, a background job could 
periodically call `ListParts` to sum the already-uploaded bytes and invoke 
`AbortMultipartUpload` on any session that exceeds the size limit or has been 
idle too long.
   
   It would be helpful to know whether these tradeoffs were already discussed, 
and whether a simpler fix was considered and ruled out for specific reasons.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to