Thanks averyone for the contributions. Ultimately, our purpose is to process documents from the site into our search database, so probably the most important thing is to limit the number of files being processed. The case of the URLs in the html probably wouldn't cause us much concern, but I could see that it might be useful to "convert" a site for mirroring from a non-case sensetive (windows) environment to a case sensetive (li|u)nix one - this would need to include translation of urls in content as well as filenames on disk.
In the meantime - does anyone know of a proxy server that could translate urls from mixed case to lower case. I thought that if we downloaded using wget via such a proxy server we might get the appropriate result. The other alternative we were thinking of was to post process the files with symlinks for all mixed case versions of files and directories (I think someone already suggested this - greate minds and all that...). I assume that wget would correctly use the symlink to determine the time/date stamp of the file for determining if it requires updating (or would it use the time/date stamp of the symlink?). I also assume that if wget downloaded the file it would overwrite the symlink and we would have to run our "convert files to" symlinks process again. Just to put it in perspective, the actual site is approximately 45gb (that's what the administrator said) and wget downloaded > 100gb (463,000 files) when I did the first process. Cheers Allan -----Original Message----- From: Micah Cowan [mailto:[EMAIL PROTECTED] Sent: Saturday, 14 June 2008 7:30 AM To: Tony Lewis Cc: Coombe, Allan David (DPS); 'Wget' Subject: Re: Wget 1.11.3 - case sensetivity and URLs -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Tony Lewis wrote: > Micah Cowan wrote: > >> Unfortunately, nothing really comes to mind. If you'd like, you could >> file a feature request at >> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option >> asking Wget to treat URLs case-insensitively. > > To have the effect that Allan seeks, I think the option would have to > convert all URIs to lower case at an appropriate point in the process. > I think you probably want to send the original case to the server > (just in case it really does matter to the server). If you're going to > treat different case URIs as matching then the lower-case version will > have to be stored in the hash. The most important part (from the > perspective that Allan voices) is that the versions written to disk > use lower case characters. Well, that really depends. If it's doing a straight recursive download, without preexisting local files, then all that's really necessary is to do lookups/stores in the blacklist in a case-normalized manner. If preexisting files matter, then yes, your solution would fix it. Another solution would be to scan directory contents for the first name that matches case insensitively. That's obviously much less efficient, but has the advantage that the file will match at least one of the "real" cases from the server. As Matthias points out, your lower-case normalization solution could be achieved in a more general manner with a hook. Which is something I was planning on introducing perhaps in 1.13 anyway (so you could, say, run sed on the filenames before Wget uses them), so that's probably the approach I'd take. But probably not before 1.13, even if someone provides a patch for it in time for 1.12 (too many other things to focus on, and I'd like to introduce the "external command" hooks as a suite, if possible). OTOH, case normalization in the blacklists would still be useful, in addition to that mechanism. Could make another good addition for 1.13 (because it'll be more useful in combination with the rename hooks). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer, and GNU Wget Project Maintainer. http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIUua+7M8hyUobTrERAr0tAJ98A/WCfPNhTOQ3Xcfx2eWP2stofgCcDUUQ nVYivipui+0TRmmK04kD2JE= =OMsD -----END PGP SIGNATURE-----