Micah Cowan <[EMAIL PROTECTED]> writes: > It is actually illegal to specify byte values outside the range of > ASCII characters in a URL, but it has long been historical practice > to do so anyway. In most cases, the intended meaning was one of the > latin character sets (usually latin1), so Wget was right to do as it > does, at that time.
Your explanation is spot-on. I would only add that Wget's interpretation of what is a "control" character is not so much geared toward Latin 1 as it is geared toward maximum safety. Originally I planned to simply encode *all* file name characters outside the 32-127 range, but in practice it was very annoying (not to mention US-centric) to encode perfectly valid Latin 1/2/3/... as %xx. Since the codes 128-159 *are* control characters (in those charsets) that can mess up your screen and that you wouldn't want seen by default, I decided to encode them by default, but allow for a way to turn it off, in case someone used a different charset. In the long run, supporting something like IRL is surely the right thing to go for, but I have a feeling that we'll be stuck with the current messy URLs for quite some time to come. So Wget simply needs to adapt to the current circumstances. If the locale includes "UTF-8" in any shape or form, it is perfectly safe to assume that it's valid to create UTF-8 file names. Of course, we don't know if a particular URL path sequence is really meant to be UTF-8, but there should be no harm in allowing valid UTF-8 sequences to pass through. In other words, the default "quote control" policy could simply be smarter about what "control" means. One consequence would be that Wget creates differently-named files in different locales, but it's probably a reasonable price to pay for not breaking an important expectation. Another consequence would be making users open to IDN homograph attacks, but I don't know if that's a problem in the context of creating file names (IDN is normally defined as a misrepresentation of who you communicate with). For those who want to hack on this, the place to look at is url.c:append_uri_pathel; that strangely-named function takes a path element (a directory name or file name component of the URL) and appends it to the file name. It takes care not to ever use ".." as a path component and to respect the --restrict-file-names setting as specified by the user. It could be made to recognize UTF-8 character sequences in UTF-8 locales and exempt valid UTF-8 chars from being treated as "control" characters. Invalid UTF-8 chars would still pass all the checks, and non-canonical UTF-8 sequences would be "rejected" (by condemning their byte values to being escaped as %..). This is not much work for someone who understands the basics of UTF-8.