Dear all,

I believe the Java Iceberg REST client encodes namespace and table
identifiers slightly incorrectly when constructing request URLs. Path
segments are built with `java.net.URLEncoder.encode(...)`, which implements
`application/x-www-form-urlencoded` — not RFC 3986 path encoding. The
visible symptom is that a space becomes `+` instead of `%20`, and a literal
`+` becomes `%2B` (indistinguishable from an encoded space after
form-decoding).

Root cause: `RESTUtil.encodeString(String)` wraps `URLEncoder.encode`. It
has two kinds of callers with incompatible requirements:

1. OAuth2 form bodies (RFC 6749) — current behavior is correct.
2. URL path segments in `ResourcePaths` (table / view / metrics / plan /
task) and per-level namespace encoding in `RESTUtil.encodeNamespace` —
current behavior is wrong per RFC 3986.

Non-Java engines get this right. DuckDB, for example, sends `%20` for a
space in a namespace or table name, so a spec-compliant server that
correctly percent-decodes path segments sees a different identifier
depending on which client issued the request.

We are already using the now-customizable separator (`\u001f`) to join
multi-level namespaces in path segments, which is itself a deviation from a
pure "one segment per level" RFC approach. That's fine as a deliberate
choice, but I believe we should still respect RFC 3986 for encoding the
level contents themselves.

Impact:
- Any namespace or table identifier containing a space, `+`, or other
characters where form-urlencoded and RFC 3986 path encoding disagree (I
believe space is bar far the most important one) is sent on the wire with
the wrong encoding from the Java client.
- A server that correctly decodes path segments sees `my+ns` instead of `my
ns` — leading to 404s, silent access of the wrong object, or catalog
inconsistency if two identifiers collide after decoding (`"a b"` vs
`"a+b"`).
- Cross-engine interop breaks: an object created by a non-Java client with
a space in the name is not addressable from the Java client, and vice versa.
- At Lakekeeper we have for some time now prohibited creation of objects
with `+` in their name and interpret `+` in path segments as space on read,
as a pragmatic workaround. Creation is unambiguous because the identifier
arrives in the request body, not the path, so we can reject it there.
Read/update/drop paths are the ones where ambiguity bites. In other
Catalogs some clients simply can't load or write to affected tables.
- The OAuth2 test in `TestRESTUtil` pins form-encoding behavior, and
`TestResourcePaths` even asserts `"plan with spaces"` →
`"plan+with+spaces"` in a path — so the current behavior is locked in by
tests. No tests cover namespace/table identifiers containing spaces or `+`.

Does anyone see a problem with fixing this in the Java client? I'd like to
understand whether anyone is relying on the current encoding (servers that
form-decode path segments, proxies, intermediate tooling) before opening an
issue/PR. If it turns out there are too many compatibility concerns to fix
it outright, I think we should at the very least document the current
encoding behavior explicitly in the REST spec, so server implementers and
other clients can interoperate deliberately. Related to that, we should
also disallow affected identifiers from being routed through generic
OpenAPI code generation for path parameters — a standards-compliant
generated client will encode per RFC 3986, and silently round-tripping
names through such a client against a form-decoding server permanently
loses the distinction between space and `+` (and the original name with it).

Thanks,
Christian

References (permalinks on `main` @ `7e4aa89`):
- `RESTUtil.encodeString`:
https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/main/java/org/apache/iceberg/rest/RESTUtil.java#L154-L157
- `RESTUtil.encodeNamespace` per-level encoding:
https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/main/java/org/apache/iceberg/rest/RESTUtil.java#L288-L300
- `ResourcePaths` path-segment callers:
https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/main/java/org/apache/iceberg/rest/ResourcePaths.java#L111
- `TestResourcePaths` pinning `+` for space in a path:
https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/test/java/org/apache/iceberg/rest/TestResourcePaths.java#L321-L330
- `TestRESTUtil.testOAuth2URLEncoding`:
https://github.com/apache/iceberg/blob/7e4aa89d9900a52620afd1456152b63b47f2223b/core/src/test/java/org/apache/iceberg/rest/TestRESTUtil.java#L143-L149

Reply via email to