George Prekas [EMAIL PROTECTED] writes:
Give some details about file name sanity.
Currently Wget encodes unsafe characters in file names according to
the rules defined for URLs: by replacing unsafe characters with a %hh
representation. The original rationale for this was to prevent
creation of directories named ~directory, and the URL functions were
used because they were convenient and because they were there.
The problems, however, far outweigh the advantages:
1. It needlessly confuses the users. When you download a file called
01-Foo Bar.mp3, you expect exactly that, not 01-Foo%20Bar.mp3.
2. It's technically wrong. The remote file name is Foo Bar.mp3,
not Foo%Bar.mp3.
3. It's technically wrong. The file names are not URLs and what is
unsafe for URLs is not so for files. Characters such as spaces in
file names are completely common, despite their being unsafe for
URL use.
4. It confuses the link conversion code. A link Foo%20Bar.mp3 will
generate a href=Foo%20Bar.mp3. That happens to work on some
browsers, but not on others. The correct form would be a
href=Foo%2520Bar.mp3, but that works on other browsers, and not
on some.
The solution is to have url_filename and friends call different,
file-name related functions. They should:
a) Be able to encode truly illegal or strange characters in URL. For
example, slashes may well appear in query string:
baz?x=http://foo.bar.baz
b) Not use the % character for encoding, so as to not interfere with
URL quoting.
Only the bare minimum of characters should be encoded. The ones that
come to mind are '/' (illegal), '~' (rm -r ~foo dangerous), '*' and
'?' (used in wildcards), control characters 0-31 (controls), and chars
128-159 (non-printable).
Under Windows, the list would include characters illegal under
Windows.
This is not much work, but I haven't yet had time to do it.