Micah Cowan <[EMAIL PROTECTED]> writes:

> It is actually illegal to specify byte values outside the range of
> ASCII characters in a URL, but it has long been historical practice
> to do so anyway. In most cases, the intended meaning was one of the
> latin character sets (usually latin1), so Wget was right to do as it
> does, at that time.

Your explanation is spot-on.  I would only add that Wget's
interpretation of what is a "control" character is not so much geared
toward Latin 1 as it is geared toward maximum safety.  Originally I
planned to simply encode *all* file name characters outside the 32-127
range, but in practice it was very annoying (not to mention
US-centric) to encode perfectly valid Latin 1/2/3/... as %xx.  Since
the codes 128-159 *are* control characters (in those charsets) that
can mess up your screen and that you wouldn't want seen by default, I
decided to encode them by default, but allow for a way to turn it off,
in case someone used a different charset.

In the long run, supporting something like IRL is surely the right
thing to go for, but I have a feeling that we'll be stuck with the
current messy URLs for quite some time to come.  So Wget simply needs
to adapt to the current circumstances.  If the locale includes "UTF-8"
in any shape or form, it is perfectly safe to assume that it's valid
to create UTF-8 file names.  Of course, we don't know if a particular
URL path sequence is really meant to be UTF-8, but there should be no
harm in allowing valid UTF-8 sequences to pass through.  In other
words, the default "quote control" policy could simply be smarter
about what "control" means.

One consequence would be that Wget creates differently-named files in
different locales, but it's probably a reasonable price to pay for not
breaking an important expectation.  Another consequence would be
making users open to IDN homograph attacks, but I don't know if that's
a problem in the context of creating file names (IDN is normally
defined as a misrepresentation of who you communicate with).

For those who want to hack on this, the place to look at is
url.c:append_uri_pathel; that strangely-named function takes a path
element (a directory name or file name component of the URL) and
appends it to the file name.  It takes care not to ever use ".." as a
path component and to respect the --restrict-file-names setting as
specified by the user.  It could be made to recognize UTF-8 character
sequences in UTF-8 locales and exempt valid UTF-8 chars from being
treated as "control" characters.  Invalid UTF-8 chars would still pass
all the checks, and non-canonical UTF-8 sequences would be "rejected"
(by condemning their byte values to being escaped as %..).  This is
not much work for someone who understands the basics of UTF-8.

Reply via email to