Micah Cowan wrote:

> Yeah, that was the original thinking. But I still hate it. For one
> thing, there are no longer any guarantees that recurse-able HTML files
> end in ".html"

There are a bunch of suffixes that are actively used for HTML plus there is
no reason that one has to include a suffix at all. Furthermore, the
existence of a .html suffix is no guarantee that the file really contains
HTML.

> It's better to let you explicitly specifiy what files to download

I think an option that says "spider the site and save any PDF files that you
find" is useful. It's a matter of figuring out a meaningful way to implement
"spider the site" for this scenario.

I wonder if it would make more sense to look at the Content-Type header and
only parse "text/html" files. By using HEAD, you can quickly ignore files
that don't need to be parsed.

Tony



Reply via email to