Hello

The '%' character is valid within Win32 filenames. The '*' and '?' are not
valid filename characters.

The '%' and '*' are wildcard characters, which is probably why they were
excluded in previous versions.

There will always be problems mapping strings between namespaces, such as
URLs and file systems. WGET could be extended to call an optional shared
library provided by the user. This would permit the user to build a
URL/Filename mapping table however they chose.

In the meantime, however, '?' is problematic for Win32 users. It stops WGET
from working properly whenever it is found within a URL. Can we fix it
please.


Kind regards
David Robinson

-----Original Message-----
From: Herold Heiko [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, 16 January 2002 03:28
To: Wget Development
Subject: RE: Mapping URLs to filenames


Some comments.

a) escape character remapping -> % not best choice ?
If I understood correctly how you are proposing to remap the urls to
directories and files we'll need to remap the escape character, too, IF
that character is a legal char for urls, otherwise it would be not
immedeatly obvious if a %xx was part of the url or is a tranlation made
by wget.
This means IF a url did contain something like somewhat%other we'll have
a file like named somewhat%25other (supposing the charset used to
generate the hex values contains % at 0x25 like ascii does) but this
also means a url fragment some%20what would map to a file some%2520what
- not a pretty thing.So possibly the % is not a good choice.

b) we're treating mostly html - remap to html entities ?
Would it be good to map some characters to things like 'agrave' instead
of hex values ? Probably not. Forget it.

b) @ on windows
I'm not sure if on some dos/win* platforms the % was ever a illegal
character.
As you stated dos/win batch files could generate real ugliness with
files containing, say, %06%20 or something (not that should be ever part
of a url but...).
Please note, if some other character than % is the escape char (say @),
some%thing should not be encoded but some%20thing most definitively on
windows should (dangerous three-char combination, a batchfile could
later somehow interpret this as "positional parameter number 20"). But
see my later point for the "at least on windows" part.

c) filename length
Why remap only dangerous characters ? There are filenames with
file/directory length limitations (minix 14 ? old unixes 14 ? dos 8.3 ?
iso9660 8.3 ? iso9660+joliet 63 ? Some other at 254 ? All of these could
have some problems with long filenames, urls generated by
cgi/jsp/whatever and so on).
However, remapping long files/directories to shorter ones creates a BIG
problems (IIRC first raised by Dan): collision - say the current file
systems is dossish and supports minimal 8.3 filenames. How to remap if
we need to save in the same directory both 01234567.htm and
0123456789.htm and lots of similar filenames ? Whatever mapping is done
"later" another file in the same directory could need exactly that name
- which means the only way to have a complete working mapping between
url fragments and filenames is a external table (some file wget
maintains in every directory).
Note the "every" - if that table would be unique for the whole download,
say, in the starting directory, it would not be available anymore if
later only some branch of the downloaded directories is used for a
successive run, so the table location must be obvious from the directory
location itself. Having a single, unique master table for the whole
download would mean lot of splicing and joining when changing parts of
the local copy before a successive run. Having a different location (not
in the directory itself) would mean more difficulty when moving those
directories around the local filesystem (need to move the directory and
- somewhere else - the table).

d) "presets"
As you said there's always the odd combination (save as vfat from linux,
save as iso9660 from whatever os, ecc.). Users should not be required to
know exactly what the requirements are (at least for the more usual file
systems - generic "longnames" unix, vfat, fat, ntfs, iso9660, vms, minix
should cover most cases) - they are users, not admins.
Beside the possibility of specifying an exact, manual, detailed setup
(command line probably is too complex, rule file specified from command
line or .wgetrc I'd say), there should be some presets included for
those usual cases mentioned above. Possibly the above+iso9660, too.
This could be as easy as some ruleset files included in the sources,
mentioned in the docs and installed by default (/usr/local/lib/wget or
wherever), or even compiled in, although compiling in any ruleset
different than the default is probably not worth it (to avoid binary
bloating, we need to be able to load external rules anyway for the
user-provided ones).

e) minimum possibility preset ? default ?
Another problem arises with downloaded directories moved from a
filesystem to another. I don't know how often this does happen, but I
definitively encountered rather often problems when moving files
downloaded on solaris to windows (vfat) machines, mainly whenever some
file contained ':'. These days moving files from whatever to iso9660 (cd
burning) could happen rather alot, too. Then there's samba and so on. OH
BTW I never checked how samba does handle the file renaming, could be
worth a look.
Anyway what I mean is, there should be a "minimal" preset, with rules
generating files without problems on *every* filesystem (we thought of).
Well, say one like that (really minimal) and one working on every file
system with a decent minimal filename length - say iso9660+joliet (63
chars I believe), this would exclude the length limitations of pure dos,
minix, possibly some other ancient unixes.
Anyway the sum of the character limitations of all those filesystems
(except the limitations present on vms, if any - no idea) should not be
too painfull to handle; and (in my experience) 63 chars are enough for a
good 99+% of cases. So a ruleset like this could possibly be a candidate
for the default. Why ?

So somebody could argue this should be the DEFAULT in order to avoid
problems for whoever didn't think before downloading or simply didn't
know problems could arise. In other words, pure users - avoid the "Argh
I have 500MB, a week worth of download on my slow link, now I can't burn
them on CD. OK, I'll make a cpio/tar/zip and burn that one. Argh now I
can't extract it on the destination machine because there are bad chars.
Let's go and rename 1000 files and edit 100 htmls referencing them,
sigh"

Somebody else could argue whoever didn't think or know should damn well
learn to think or just learn, and bad experiences ("Argh!") are the best
way to learn. Possibly the majority of wget users are capable enough...
but should wget be a nice tool for sysadmins and alikes or a tool for
users (who'd like a out-of-box working tool, not a
sometimes-a-nightmare-because-I-forgot) and sysadmins (who are capable
of changing some defaults) ?
Please note, there'll be no poblems for people who, say, mirror a
website hosted on unix on a unix box - they just need to select the
"unix" ruleset instead of the default "minimal-with-longish-filenames",
and everything is ok.

Personally I would feel ok with either default, as will be everybody who
does read the documentation of a tool before using it.

Heiko

-- 
-- PREVINET S.p.A.            [EMAIL PROTECTED]
-- Via Ferretto, 1            ph  x39-041-5907073
-- I-31021 Mogliano V.to (TV) fax x39-041-5907087
-- ITALY

> -----Original Message-----
> From: Ian Abbott [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, January 15, 2002 1:00 PM
> To: Wget Development
> Subject: Mapping URLs to filenames
> 
> 
> This is an initial proposal for naming the files and directories
> that Wget creates, based on the URLs of the retrieved documents.
> 
> At the moment there are many complaints about Wget failing to save
> documents which have '?' in their URLs when running under Windows,
> for example. In general, the set of illegal characters in
> file-names depends on the the operating system and the file-system
> in use. Wget can be compiled for different operating systems, but
> doesn't know which file-system is being used - you may get the
> oddball who wants to save files to a vfat file-system from Linux
> for example! Therefore, there should be some way to override or
> augment the set of illegal filename characters using a wgetrc
> command, for example.
> 
> File-names used within the internals of Wget need to be converted
> to an external form which deals with illegal characters or illegal
> sequences of characters in the file-name. The internal filename
> consists of directory separators ('/'), illegal characters, a
> nominated 'escape' character and other (legal) characters.
> 
> Illegal characters in the internal file-name can be mapped to an
> escape sequence in the external file-name, consisting of the escape
> character followed by two hex digits (it is assumed that both the
> escape character and the hex digits are legal file-name characters
> for the operating system and file-system in use!). Escape
> characters in the internal file-name can be mapped to an escape
> sequence in the same way.
> 
> The directory separator character ('/') in the internal file-name
> is usually mapped to the directory hierachy on the file-system, but
> if the internal file-name contains two or more consecutive
> directory separator characters, some of these will need to be
> escaped to avoid trying to create directories with null names. (An
> alternate solution is to create a directory whose name consists
> solely of a single escape character.)
> 
> The external file-names are easily reversible back to the internal
> form when necessary.
> 
> The obvious candidate for the escape character is the '%'
> character, although the escape mechanism for file-names is
> logically distinct from the escape mechanism for HTTP. The current
> version of Wget for Windows remaps all '%' characters to '@', so
> perhaps '@' is a better candidate for the escape character for
> Windows. (I'm not sure why Wget does this, as '%' seems to be a
> legal file-name character for Windows and MS-DOS. Perhaps it is
> for usability reasons due to the command shell's variable
> interpolation of '%name%' sequences.) The escape character can be
> made operating system dependent, and perhaps could be overridden
> with a wgetrc command.
> 
> That's my initial proposal anyway. I'm not sure about things such
> as UTF-8 should be handled, or if that's an issue at all.
> 

Reply via email to