carloea2 commented on PR #4130:
URL: https://github.com/apache/texera/pull/4130#issuecomment-3640648605
---
**Why I prefer backing multipart uploads with a DB**
I am proposing to store multipart upload state in a DB instead of keeping
everything fully stateless, for several reasons:
1. **Clear concurrency guarantees per `(uploadId, partNumber)`**
Both Huawei OBS and Amazon S3 allow multiple uploads to the same
`uploadId` + `partNumber` and simply apply last-write-wins semantics. OBS is
very explicit about concurrent streams: its `UploadPart` API states that if you
repeatedly upload the same `partNumber`, the last upload overwrites the
previous one, and when there are multiple concurrent uploads for the same
`partNumber` of the same object, the server uses a Last Write Win policy and
recommends locking concurrent uploads of the same part on the client side to
ensure data accuracy. S3’s multipart upload docs similarly say that if you
upload a new part using a part number that was used before, the previously
uploaded part is overwritten. In other words, neither backend prevents
concurrent writers to the same part; they just pick a winner.
With a DB row keyed by `(upload_id, part_number)` we can use row-level
locking to enforce “only one active stream can write this part” in a way that
is explicit and portable, regardless of which storage backend we are using.
This is useful both for well-behaved clients (we can give strong semantics) and
for protecting ourselves from obvious waste (e.g., the same upload repeatedly
opening the same part). In a multi-node/Kubernetes deployment, in-memory locks
or sticky sessions are fragile or operationally complex; a shared DB is the
simplest coordination point we already have. Without a DB we would need to
complicate the API (for example `/uploadPart/{uploadToken}/{partNumber}` plus
client-side coordination and disabled concurrency on that endpoint) and we
still would not have a single canonical place where we enforce this rule. Note
that this is complementary to, not a replacement for, rate limiting and abuse
protection at other layers.
2. **Efficient and well-scoped listing, aligned with S3 guidance**
If the client stores the raw `uploadId` and we rely purely on the
S3-compatible APIs, completing an upload means calling `ListParts` or
`ListMultipartUploads` and filtering:
* `ListMultipartUploads` lists in-progress multipart uploads for the
entire bucket, up to 1000 at a time.
* `ListParts` lists the parts for a specific multipart upload, given its
`uploadId`.
In our lakeFS setup I observed that `ListMultipartUploads` behaves
effectively repo-wide, and the prefix filtering we would like (for something
like `/dataset-1/...`) does not give us the narrow, efficient listing we want.
As more uploads accumulate under the same repo, listing gets slower and less
predictable. On top of that, the S3 Developer Guide explicitly says that the
multipart part listing is not supposed to be your source of truth for
completing an upload and that you should maintain your own list of part numbers
and ETags. A DB table is exactly that “own list”: an indexed, well-scoped
record of parts per logical upload that we can query efficiently and
deterministically, independent of how many uploads exist in the bucket or repo,
and independent of quirks of the underlying S3/lakeFS implementation.
3. **Opaque upload tokens vs exposing `uploadId` (and the encrypted token
alternative)**
I would like to avoid exposing the raw S3 `uploadId` directly to clients.
It is a backend-specific identifier, and once clients depend on it we are
coupled to a specific storage implementation and data model.
There are two designs we could follow:
* **Encrypted token approach, client as temporary storage**
We could encode `uploadId` and any metadata (path, repo, limits,
timestamps, etc.) into an encrypted and signed token, give that to the client,
and let the client act as our temporary storage. On each request, the client
sends the token back, we decrypt and verify it, and recover the `uploadId`.
This works, but it has tradeoffs:
* The token grows as we pack more metadata inside.
* Any schema change or new metadata field requires token versioning and
migration logic.
* Server-side changes such as aborting an upload, tightening limits, or
invalidating uploads become harder, because the server no longer owns the
authoritative state and must handle multiple token versions that may be “in the
wild”.
* **DB-backed opaque token (preferred)**
With a DB, we can generate a short-lived opaque upload token that
simply references a row where we store the real `uploadId` plus metadata. The
client never sees `uploadId` directly, and the token itself can stay small and
stable. All evolution (new columns, extra constraints, per-user quotas, flags,
counters) happens in the DB without changing the client contract. Invalidating
or aborting an upload is also as simple as flipping state in the DB.
Both approaches keep `uploadId` hidden, but the DB-backed option is
simpler to reason about, more flexible as we add features, and easier to revoke
or invalidate than encoding everything into an ever-growing token.
4. **Enforcing max file size or resource limits without hammering the DB**
If we want a hard limit on the total bytes an upload can consume, we need
to enforce that while data is streaming, not only at the very end when we call
`CompleteMultipartUpload`. The multipart upload docs already recommend
maintaining our own mapping from part numbers to ETags for completion; that
same per-upload metadata is a natural place to store limits.
My preferred approach is:
* The frontend tells us the intended total file size and desired number
of parts.
* The backend computes a per-part limit `maxPartSize = fileSize /
numParts` and stores it in the DB as metadata for that upload.
* When the client uploads part X, we require `Content-Length` and verify
that it is less than or equal to `maxPartSize` for this upload before streaming
to S3. We can also keep a local counter for the current request and immediately
cut the stream once it exceeds `maxPartSize`, so the enforcement is truly
“hard” from our perspective.
This reduces DB work to O(number of parts) checks instead of O(number of
chunks). For example, 1 GiB split into 10 parts of 100 MiB each, streamed in 8
KiB chunks, would otherwise require 100 000+ DB `addAndGet` calls if we tried
to maintain a global accumulator per chunk. Here we do at most 10 DB-backed
checks, because we only need to validate each part once. Since `maxPartSize`
and any retry counters live server-side in the DB (not in a client-controlled
token), clients cannot bump their own limit or reset their own attempt counters.
**If we do not have `fileSize` and `numParts` from the frontend** we have
strictly worse options:
* Maintain a global per-upload byte counter and update it on every chunk:
this is a very hot DB row and write-heavy.
* Only update the counter per completed part: cheaper on the DB, but then
we only detect a violation after the part has already consumed bandwidth and
CPU.
* Use a fixed global `maxPartSize` based on backend limits (for example 5
GiB S3 part size) and derive a maximum number of parts from an overall limit:
this is coarse and either rejects otherwise valid uploads or still requires
tracking remaining budget per upload in a more complex way.
Also, if we enforce a global max part size equal to S3’s limit (5 GiB),
then many realistic uploads will be restricted to a single part and therefore
lose the benefit of concurrent part uploads entirely. With the DB-backed
per-upload `maxPartSize` we can still enforce a hard overall limit while
allowing the client to choose a higher degree of concurrency for objects that
are much smaller than 5 GiB.
Finally, since this metadata lives in the DB, we can track additional
abuse-related signals per upload (e.g., how many times an upload has been
started and cancelled right before completion) and cap retries or detect
obviously wasteful patterns without continuously growing a token or
round-tripping more and more encoded state to the client.
5. **Operational benefits and future features**
A DB row per upload also gives us:
* Better observability: we can see which uploads exist, their status,
size, parts, timestamps, owners, and retry history without scraping object
store listings.
* Easier cleanup: we can implement “abort incomplete uploads older than N
hours or days” using our DB, instead of scanning all multipart uploads in the
bucket and reverse-mapping them to our logical repos.
* A natural place to attach future features such as per-user quotas,
project-level limits, resumable uploads, server-driven cancellation, or
finer-grained policies, all without changing the external client-visible API or
token format.
Rate limiting, WAF rules, and other perimeter protections are still
necessary for dealing with truly malicious clients, but having upload state
modeled explicitly in a DB gives us a clean, consistent substrate for
correctness, per-upload policies, and evolution over time.
---
@aicam
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]