Re: wget url with hash # issue

Micah Cowan Thu, 06 Sep 2007 09:49:17 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> If you mean that you want Wget to find any file that matches that
>> wildcard, well no: Wget can do that for FTP, which supports directory
>> listings; it can't do that for HTTP, which has no means for listing
>> files in a "directory" (unless it has been extended, for example with
>> WebDAV, to do so).
> 
> Seems to me that is a big "unless" because we've all seen lots of websites
> that have http directory listings. Apache will do it out of the box (and by
> default) if there is no index.htm[l] file in the directory.


No it won't. What I was talking about was a standard way to list
directory contents. What Apache, and other web servers, do by default is
not quite the same thing: it's not particularly easy to distinguish
between when the server is sending us a "web page" versus a "directory
listing". WebDAV's PROPFIND, OTOH, is always a very dependable,
standardized way to get a listing (but requires that the, relatively
rare, WebDAV extensions to HTTP are enabled at the server).

I suppose that we could parse out the contents of the hierarchical
component just prior to the wildcarded component, and look for relative
URLs at the same hierarchy level, which could be a decent heuristic for
seeing if it's a generated index; but that seems to me too much trouble
to be worth it for such an undependable method. The same method will
work at some locations on a web server, and not at others, which could
be pretty confusing to users.

One could argue that we are already forced to use heuristics to
interpret FTP listings, but the truth is those, we (1) always know that
the results are a listing, and not potentially something else, (2) most
of the world seems to have standardized on Unix-style directories
AFAICT, meaning there's a common de facto standard (3) there is now a
more rigidly-defined standard for obtaining directory listings in FTP,
which I'm hoping may catch on.

For HTTP, we would have to analyze the file contents to even figure out
whether we think it's a directory listing or not.

> Perhaps we could have a feature to grab all or some of the files in a HTTP
> directory listing. Maybe something like this could be made to work:
> 
> wget http://www.exelana.com/images/mc*.gif
> 
> Perhaps we would need an option such as --http-directory (the first thing
> that came to mind, but not necessarily the most intuitive name for the
> option) to explicitly tell wget how it is expected to behave. Or perhaps it
> can just try stripping the filename when doing an http request and wildcards
> are specified.

How many different ways could wget behave to try to get an
"http-directory" from an index output? Stripping would seem the best
option (but I'm still against this idea).

> At any rate (with or without the command line option), wget would retrieve
> http://www.exelana.com/images/ and then retrieve any links where the target
> matches mc*.gif.
> 
> If wget is going to explicitly support http directory listings, it probably
> needs to be intelligent enough to ignore the sorting options. In the case of
> Apache, that would be things like <A HREF="?N=D">Name</A>.

IMO, that's potentially dangerous, if we end up guessing wrong about
this being a simple directory listing.

As far as I'm concerned, wildcarding plain HTTP isn't really worth the
effort, as plain HTTP just wasn't designed to do listings (or copying,
or any of the things WebDAV was introduced for). And, we can accomplish
the same thing by specifying the "directory" URL, and giving -A.gif.

(Yes, I'm aware that -A.gif will also get gifs at lower directory
levels; but as we've been discussing recently it is likely that we will
at the very least be providing an option not to fetch the HTML files
when they were not specified, to facilitate migration to more generic
MIME-type processing. And, since we also plan on making options like -A
restrictable to specific URLs, it seems like we'll have sufficient
handling for this sort of thing through that avenue, without having to
make assumptions about whether we're looking at a "directory listing" or
not.)

Note that I'm not saying we should implement wild-carding for servers
that do support WebDAV, either: AFAICT DAV is still a fairly rare thing
to encounter, and not worth implementing. But, I probably wouldn't
refuse a patch on that if someone _else_ wants to put the work in,
whereas I'm not really convinced yet for "generated indexes" in plain HTTP.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD4DBQFG4C9z7M8hyUobTrERCNo1AKCFSFGtEm7xbFXEUEt6f90xcLLpPwCYmZzK
s74UH0Do7pPux94sQ7eEew==
=0qNO
-----END PGP SIGNATURE-----

Re: wget url with hash # issue

Reply via email to