AnzhiZhang commented on issue #4110:
URL: https://github.com/apache/texera/issues/4110#issuecomment-4035909773
I came across this issue from the 1.1.0-incubating RC4 release mailing list.
Before sharing some thoughts, I want to confirm my understanding of the change:
- The multipart upload flow is being changed from client-direct-to-MinIO
(via presigned URLs) to fully server-proxied through File Service
- A new DB schema is introduced to track upload session state and abort
uploads when the cumulative size exceeds the limit
- The stated motivations are:
- Enforce file size limits that cannot be bypassed by modifying frontend
code
- Avoid calling `listParts()`
- Abort oversized uploads mid-stream
A few questions and concerns, in case they weren't already addressed in
earlier discussions:
**1. The threat model**
Users who can upload files are authenticated. In an organization-internal
deployment, most users can reasonably be assumed trusted, making this attack
vector relatively low-motivation. For publicly open deployments, please see the
discussion below.
**2. Alternative mitigations for the size bypass**
The size limit bypass described in #4058 is a valid concern. However, it
seems the same goal could be achieved without proxying bytes through the server:
- Requiring clients to declare `totalSizeBytes` at init time and validating
it server-side
- Configuring a MinIO lifecycle policy to automatically abort incomplete
multipart uploads after a timeout
- After the multipart upload is completed, check the actual size and delete
the object if it exceeds the limit
With this approach, even if a user bypasses the init-time check, the file
would still be deleted after upload completes. The remaining risk is temporary
resource exhaustion, which can be handled through user quotas and rate limiting
on starting upload.
It's also worth noting that bypassing the presigned URL flow requires
writing custom client code, this is deliberate malicious behavior from an
authenticated user, which may warrant a different risk assessment.
Additionally, while `content-length-range` cannot restrict individual multipart
parts, S3 bucket policies support the `s3:content-length` condition key, which
may offer another enforcement point. It may also be possible to cap the total
upload size simply by limiting the number of presigned URLs issued. For
example, at 50 MB per part, issuing at most 200 URLs would effectively cap
uploads at 10 GB. Whether MinIO supports these policy conditions would need
verification, but seems worth exploring.
**3. On avoiding `listParts()`**
It's not clear why avoiding `listParts()` is listed as a benefit.
`listParts()` or getting file stats is a straightforward way to verify per-part
state and the true file size. The overhead of either call is negligible
compared to the cost of routing all upload traffic through File Service.
**4. Server performance under the new design**
Proxying byte streams through File Service will consume server CPU, memory,
and network bandwidth, especially with the frontend supporting up to 10
concurrent chunks per file and a finite server thread pool. Heavy concurrent
uploads could potentially degrade performance for other users. The purpose of
offloading to MinIO with presigned URLs is precisely to separate file transfer
from application logic; the new design reintroduces that coupling and adds a
server communication hop that previously did not exist. The issue doesn't seem
to discuss this tradeoff in the context of issues.
**5. Is this a common pattern for this use case?**
Server-proxied uploads are typically used for real-time content scanning,
anonymous uploads, or when the existence of the object store needs to be hidden
from clients. It's not obvious that Texera's dataset upload falls into any of
these categories.
**6. On the watcher approach from #4058**
As for the "watcher" idea mentioned in #4058: since the backend already
holds the `uploadId` for each in-progress upload, a background job could
periodically call `ListParts` to sum the already-uploaded bytes and invoke
`AbortMultipartUpload` on any session that exceeds the size limit or has been
idle too long.
It would be helpful to know whether these tradeoffs were already discussed,
and whether a simpler fix was considered and ruled out for specific reasons.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]