-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Tony Lewis wrote: > Micah Cowan wrote: > >> If you mean that you want Wget to find any file that matches that >> wildcard, well no: Wget can do that for FTP, which supports directory >> listings; it can't do that for HTTP, which has no means for listing >> files in a "directory" (unless it has been extended, for example with >> WebDAV, to do so). > > Seems to me that is a big "unless" because we've all seen lots of websites > that have http directory listings. Apache will do it out of the box (and by > default) if there is no index.htm[l] file in the directory.
No it won't. What I was talking about was a standard way to list directory contents. What Apache, and other web servers, do by default is not quite the same thing: it's not particularly easy to distinguish between when the server is sending us a "web page" versus a "directory listing". WebDAV's PROPFIND, OTOH, is always a very dependable, standardized way to get a listing (but requires that the, relatively rare, WebDAV extensions to HTTP are enabled at the server). I suppose that we could parse out the contents of the hierarchical component just prior to the wildcarded component, and look for relative URLs at the same hierarchy level, which could be a decent heuristic for seeing if it's a generated index; but that seems to me too much trouble to be worth it for such an undependable method. The same method will work at some locations on a web server, and not at others, which could be pretty confusing to users. One could argue that we are already forced to use heuristics to interpret FTP listings, but the truth is those, we (1) always know that the results are a listing, and not potentially something else, (2) most of the world seems to have standardized on Unix-style directories AFAICT, meaning there's a common de facto standard (3) there is now a more rigidly-defined standard for obtaining directory listings in FTP, which I'm hoping may catch on. For HTTP, we would have to analyze the file contents to even figure out whether we think it's a directory listing or not. > Perhaps we could have a feature to grab all or some of the files in a HTTP > directory listing. Maybe something like this could be made to work: > > wget http://www.exelana.com/images/mc*.gif > > Perhaps we would need an option such as --http-directory (the first thing > that came to mind, but not necessarily the most intuitive name for the > option) to explicitly tell wget how it is expected to behave. Or perhaps it > can just try stripping the filename when doing an http request and wildcards > are specified. How many different ways could wget behave to try to get an "http-directory" from an index output? Stripping would seem the best option (but I'm still against this idea). > At any rate (with or without the command line option), wget would retrieve > http://www.exelana.com/images/ and then retrieve any links where the target > matches mc*.gif. > > If wget is going to explicitly support http directory listings, it probably > needs to be intelligent enough to ignore the sorting options. In the case of > Apache, that would be things like <A HREF="?N=D">Name</A>. IMO, that's potentially dangerous, if we end up guessing wrong about this being a simple directory listing. As far as I'm concerned, wildcarding plain HTTP isn't really worth the effort, as plain HTTP just wasn't designed to do listings (or copying, or any of the things WebDAV was introduced for). And, we can accomplish the same thing by specifying the "directory" URL, and giving -A.gif. (Yes, I'm aware that -A.gif will also get gifs at lower directory levels; but as we've been discussing recently it is likely that we will at the very least be providing an option not to fetch the HTML files when they were not specified, to facilitate migration to more generic MIME-type processing. And, since we also plan on making options like -A restrictable to specific URLs, it seems like we'll have sufficient handling for this sort of thing through that avenue, without having to make assumptions about whether we're looking at a "directory listing" or not.) Note that I'm not saying we should implement wild-carding for servers that do support WebDAV, either: AFAICT DAV is still a fairly rare thing to encounter, and not worth implementing. But, I probably wouldn't refuse a patch on that if someone _else_ wants to put the work in, whereas I'm not really convinced yet for "generated indexes" in plain HTTP. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFG4C9z7M8hyUobTrERCNo1AKCFSFGtEm7xbFXEUEt6f90xcLLpPwCYmZzK s74UH0Do7pPux94sQ7eEew== =0qNO -----END PGP SIGNATURE-----