A canonical URL host name dilemma

Daniel Stenberg via curl-library Sat, 09 Oct 2021 02:43:03 -0700

Hello friends.

Let me take you through a bug, my current work and the little dilemma I'mfacing in regards to how to "canonicalize" host names in URLs! I'll end themail with a question about a possible solution I've thought of.


# Not parsing percent-encoded host names in URLs

    $ curl https://%63url.se/
    curl: (6) Could not resolve host: %63url.se

instead of:

    $ curl https://%63url.se/
    [content from https://curl.se]

Issue: https://github.com/curl/curl/issues/7830
PR: https://github.com/curl/curl/pull/7834

## Obvious first take

 Make sure that the URL parser **decodes** percent-encoded host names. %41
 becomes `A` etc.

 The parser rejects "control codes" while decoding. %00, %0a and %0d makes the
 host name illegal.

## Canonical host name

 The URL API can also *extract* the full URL so it needs to be able to reverse
 the process and here begins the challenges.

 My first simplistic (or maybe *naive*) approach works like this:

 Setting `https://%63url.se/` is extracted again as `https://curl.se/` but
 setting `https://%c0.se/` is extracted as `https://%c0.se/` (since anything
 non-ASCII is not "URL compliant").

## IDN input

 Enter IDN. Internation Domain Names. They are specified outside of the
 regular URL spec (RFC 3986) and they are specified using non-ASCII byte
 codes.

 Example name: `räksmörgås.se` (clients puny-encode this name to
 `xn--rksmrgs-5wao1o.se` for DNS etc).

 Since this host/URL uses non-ASCII letters, the naive approch mentioned above
 would then, when the URL API is used to extract this again, use a sequence of
 percent-encoded UTF-8 `r%C3%A4ksm%C3%B6rg%C3A5s.se`.

 It would **not** extract back to `räksmörgås.se`, which probably is what a
 user will expect.

 Next-level complication: mix in percent-encoding to the IDN name:

 `r%c3%a4ksmörgås.se`

 The two percent-encoded bytes is UTF-8 sequence for `ä`, which makes this
 host name work the same way.

## IDN output

 How do we know how to encode the host name when the user wants to extract it?

 Alternatives I can think of:

### A) Don't

 Store the originally provided name and use that for retrieval as well. This
 is bad as then the same URL with differently encoded host names will appear
 as two different ones. Users probably will not expect nor appreciate this.

### B) Always

 Always percent-encode (this is what the PR currently does). It makes the host
 name canonical and it still works IDN wise, but the retrieved URL is ugly and
 user hostile.

### C) Puny-encode

 Return the **puny-encoded** version of the name if it was an IDN name,
 otherwise percent-encode. Makes the host name canonical, it still works IDN
 wise, but the retrieved URL is ugly and user hostile. Just possibly a little
 less hostile than version B. An upside could be that a puny-code version of
 the host name works even with clients that don't speak IDN.  This method then
 works differently if libcurl was built with or without IDN support.

### D) Heuristics

 If the host name was a valid IDN name, then return that name without
 encoding, otherwise perecent-encode. This makes `r%c3%a4ksmörgås.se` as input
 generate `räksmörgås.se` as output.  This method then works differently if
 libcurl was built with or without IDN support.



Can we make version (D) work and would that be preferred?

--

 / daniel.haxx.se
 | Commercial curl support up to 24x7 is available!
 | Private help, bug fixes, support, ports, new features
 | https://curl.se/support.html

-- 
Unsubscribe: https://lists.haxx.se/listinfo/curl-library
Etiquette:   https://curl.haxx.se/mail/etiquette.html

A canonical URL host name dilemma

Reply via email to