Micah Cowan wrote: > Yeah, that was the original thinking. But I still hate it. For one > thing, there are no longer any guarantees that recurse-able HTML files > end in ".html"
There are a bunch of suffixes that are actively used for HTML plus there is no reason that one has to include a suffix at all. Furthermore, the existence of a .html suffix is no guarantee that the file really contains HTML. > It's better to let you explicitly specifiy what files to download I think an option that says "spider the site and save any PDF files that you find" is useful. It's a matter of figuring out a meaningful way to implement "spider the site" for this scenario. I wonder if it would make more sense to look at the Content-Type header and only parse "text/html" files. By using HEAD, you can quickly ignore files that don't need to be parsed. Tony