Thanks averyone for the contributions.

Ultimately, our purpose is to process documents from the site into our
search database, so probably the most important thing is to limit the
number of files being processed.  The case of  the URLs in the html
probably wouldn't cause us much concern, but I could see that it might
be useful to "convert" a site for mirroring from a non-case sensetive
(windows) environment to a case sensetive (li|u)nix one - this would
need to include translation of urls in content as well as filenames on
disk.

In the meantime - does anyone know of a proxy server that could
translate urls from mixed case to lower case.  I thought that if we
downloaded using wget via such a proxy server we might get the
appropriate result.  

The other alternative we were thinking of was to post process the files
with symlinks for all mixed case versions of files and directories (I
think someone already suggested this - greate minds and all that...). I
assume that wget would correctly use the symlink to determine the
time/date stamp of the file for determining if it requires updating (or
would it use the time/date stamp of the symlink?). I also assume that if
wget downloaded the file it would overwrite the symlink and we would
have to run our "convert files to" symlinks process again.

Just to put it in perspective, the actual site is approximately 45gb
(that's what the administrator said) and wget downloaded > 100gb
(463,000 files) when I did the first process.

Cheers
Allan

-----Original Message-----
From: Micah Cowan [mailto:[EMAIL PROTECTED] 
Sent: Saturday, 14 June 2008 7:30 AM
To: Tony Lewis
Cc: Coombe, Allan David (DPS); 'Wget'
Subject: Re: Wget 1.11.3 - case sensetivity and URLs


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Unfortunately, nothing really comes to mind. If you'd like, you could

>> file a feature request at 
>> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option

>> asking Wget to treat URLs case-insensitively.
> 
> To have the effect that Allan seeks, I think the option would have to 
> convert all URIs to lower case at an appropriate point in the process.

> I think you probably want to send the original case to the server 
> (just in case it really does matter to the server). If you're going to

> treat different case URIs as matching then the lower-case version will

> have to be stored in the hash. The most important part (from the 
> perspective that Allan voices) is that the versions written to disk 
> use lower case characters.

Well, that really depends. If it's doing a straight recursive download,
without preexisting local files, then all that's really necessary is to
do lookups/stores in the blacklist in a case-normalized manner.

If preexisting files matter, then yes, your solution would fix it.
Another solution would be to scan directory contents for the first name
that matches case insensitively. That's obviously much less efficient,
but has the advantage that the file will match at least one of the
"real" cases from the server.

As Matthias points out, your lower-case normalization solution could be
achieved in a more general manner with a hook. Which is something I was
planning on introducing perhaps in 1.13 anyway (so you could, say, run
sed on the filenames before Wget uses them), so that's probably the
approach I'd take. But probably not before 1.13, even if someone
provides a patch for it in time for 1.12 (too many other things to focus
on, and I'd like to introduce the "external command" hooks as a suite,
if possible).

OTOH, case normalization in the blacklists would still be useful, in
addition to that mechanism. Could make another good addition for 1.13
(because it'll be more useful in combination with the rename hooks).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer,
and GNU Wget Project Maintainer.
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ
nVYivipui+0TRmmK04kD2JE=
=OMsD
-----END PGP SIGNATURE-----

Reply via email to