It seems like our specification made a poor choice in terms of separator
character, do we have any disagreement on this point?

Not from me! But I think the next question is whether there is a better one
for the general case so that we can just choose a new one and be done. I
can’t think of a better separator that satisfies everyone.

That’s why I think it makes sense for this to be configurable via the
config route. That lets services use a separator that works for the server
implementation. It’s also simple for users. The vast majority of the time,
users would not need to know about it.

It looks like by choosing a control character, we ended up generating
requests which modern server systems define as potentially dangerous.

There are potentially two problems to solve here. You’re right that the
general problem is modern checks, but there are two cases to worry about:
the separator character used for namespaces, and URL characters that are
now suspicious, like %2F. The second case could impose limitations on the
character set that a service can effectively use for names.

A valid question is whether we should be trying to solve both problems or
just the first. I should also mention that we’ve hit the first problem in
the wild and have had to quickly move to patch it.

Any escaping seems like it will also cause client and server changes
similar to just changing the character so it seems like similar developer
complexity there.

Both solutions will only help in future client versions, but I think there
are differences in complexity.

Adding a config option for the separator introduces a fairly small spec
change to add the config property. I think that we could reasonably add
support for this without bumping all of the endpoint versions (as long as
we continue to support 0x1F).

Adding escaping would add considerably more complexity, with a separator
character (that is also not 0x1F), an escape character, and an identifier
parser that follows a set of rules.

It’s good to note that both of those options don’t address the issue that
some URL characters are also rejected. To fix the second problem, I think
we would need to change how we encode certain characters to avoid having
them caught by URL encoding.

Either adding a new way to encode certain characters or adding an escaping
scheme seems to me like it would need an endpoint version bump.

I do not see why “old clients are already broken” - they behave according
to the REST spec.

The situation we’re talking about is in cases where the service cannot
support multi-part namespaces because of the current separator character.
Right now, clients cannot interact with a service that uses a toolchain
that disallows 0x1F so all clients are currently broken. If we allow the
separator to be configured, that makes it possible to fix new clients. It
won’t fix old clients, but they are already broken by the updated toolchain.

The “root cause” is that neither the Iceberg REST spec nor the “reference
implementation” poses any restriction on the the allowed set of characters
in namespace elements, which is therefore also part of the Iceberg REST
spec / “reference implementation”.

If I understand correctly, I think this is arguing that because the spec
doesn’t restrict characters, it requires support for all characters. I
disagree with that. The spec doesn’t specify characters that cannot be
supported, but I think that it leaves room for implementations to choose
their own restrictions. Many catalogs have restrictions on the names that
are valid.

On Tue, Aug 13, 2024 at 8:11 AM Russell Spitzer <russell.spit...@gmail.com>
wrote:

> I feel like we are over complicating this situation.
>
> It seems like our specification made a poor choice in terms of
> separator character, do we have any disagreement on this point? It looks
> like by choosing a control character, we ended up generating requests which
> modern server systems define as potentially dangerous. Although it seems
> like this hasn't come up in the wild as far as I know it seems like it
> makes sense for us to just change the character to something else, modify
> the spec and move forward.
>
> The ability to custom define characters for namespace separation wasn't an
> issue before and I'm not sure why we want to introduce that capability now
> on either the Server or Client side although the Server side seems less
> complicated here. I think escaping is probably fine as well if that can
> work but i'm not sure the Servlet spec allows that?  Any escaping seems
> like it will also cause client and server changes similar to just changing
> the character so it seems like similar developer complexity there.
>
> Overall I really want to just emphasize that I don't think this is a place
> where users either on the Server or Client side should have to think about
> configuration.
>
> On Tue, Aug 13, 2024 at 4:02 AM Robert Stupp <sn...@snazy.de> wrote:
>
>> I do not see why "old clients are already broken" - they behave according
>> to the REST spec.
>>
>> The current REST spec states "If parent is a multipart namespace, the
>> parts must be separated by the unit separator (`0x1F`) byte" [1]. Using a
>> different separator, even if it is "backwards compatible", is a change to
>> the Iceberg REST spec.
>>
>> If _one_ service is subject to new(er) restrictions, it is IMO that
>> service that cannot follow the REST spec. Jetty 12 (Servlet Spec v6) was
>> mentioned as a justification. However, Jetty 12 and other implementations
>> allow operators to disable the "suspicious character validation".
>>
>> The proposed change to the REST spec and implementation would only
>> address the problem for that service, but not for all services. PMC members
>> said on the "Iceberg Catalog Sync" call on Aug 7 that they want this change
>> for that service implementation and defer a solution that works for all
>> implementations, explicitly mentioning "Nessie", for quite some time.
>>
>> The "root cause" is that neither the Iceberg REST spec nor the "reference
>> implementation" poses any restriction on the the allowed set of characters
>> in namespace elements, which is therefore also part of the Iceberg REST
>> spec / "reference implementation".
>>
>> That said, the proposed change does not work for all implementations.
>> Keeping the Iceberg REST spec as is and configuring the (affected) services
>> that they accept the '%1F' separator character seems much easier and less
>> controversial and eliminates the need for these changes entirely.
>>
>> An alternative to use an appropriate escaping mechanism, was turned down
>> on the same "Iceberg Catalog Sync" call.
>>
>>
>> Robert
>>
>>
>> [1]
>> https://github.com/apache/iceberg/blob/bfab2c334e9b4c11de65f1f9bd1de5dab18aae5b/open-api/rest-catalog-open-api.yaml#L225
>>
>>
>> On 06.08.24 21:24, Yufei Gu wrote:
>>
>> Thanks Ryan and Daniel for the context. I'm OK that the server provides a
>> separator via config endpoint. Just want to dig a bit more, it does provide
>> more flexibility for the separator, but why do we need this flexibility?
>> Looks like the only benefit is to reconstruct the namespaces. What's the
>> use case of reconstruction? We may talk about it in tomorrow's meeting.
>>
>> Thanks Eduard for the PR.
>>
>>
>> Yufei
>>
>>
>> On Tue, Aug 6, 2024 at 3:10 AM Eduard Tudenhöfner <
>> etudenhoef...@apache.org> wrote:
>>
>>> It's probably best to make this configurable rather than changing the
>>> separator in the spec. The server can communicate to the client which
>>> namespace separator should be used (via a config override), but still needs
>>> to support *%1F* as a fallback. I've opened #10877
>>> <https://github.com/apache/iceberg/pull/10877> to address this.
>>>
>>> @Robert: Old clients are already broken with *%1F* so we won't be able
>>> to avoid an old client failing with a 4xx. Introducing a new endpoint that
>>> takes a different namespace separator won't fix the issue for old clients
>>> either, since those clients won't know anything about that endpoint.
>>>
>>>
>>> On Mon, Aug 5, 2024 at 7:07 PM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> I would agree with adding either a server side (config override) or
>>>> client side control (query param with `?delim=.`) as it will be
>>>> compatible with the current v1 endpoint.
>>>>
>>>> In the future we could introduce a v2 endpoint(s), but I would want to
>>>> wait for OpenAPI 4 because they address this by allowing multi-segment
>>>> pathing via URI templates in RFC 6570
>>>> <https://datatracker.ietf.org/doc/html/rfc6570>, which is the original
>>>> way we wanted to represent namespaces, but it wasn't supported (e.g.
>>>> .../{+namespaces}/tables/{table}).  I doubt it's really worth the
>>>> effort though, so I feel like a configurable delimiter makes the most 
>>>> sense.
>>>>
>>>> -Dan
>>>>
>>> --
>> Robert Stupp
>> @snazy
>>
>>

-- 
Ryan Blue
Databricks

Reply via email to