Gavin Sherlock wrote: > i.e. the image is generated on the fly from a script, which then > essentially prints the image back to the browser with the correct > mime type. While this is a non-standard way to include an image on a > page, the --page-requisites are not fulfilled when retrieving this > web page.
I don't think you can consider this a "non-standard way". I'm sure there's a whole paragraph in a RFC (HTML 4.01 spec) about properly dealing with URI, linked resources and MIME types. For example if a PNG image is generated using a "gen_png_image.php" PHP script, I think wget should be able to download it if the option "--page-requisites" is used, because it's part of the page and it's not an external resource, get its MIME type, "image/png", and using the option "--convert-links" should also rename the script-image to "gen_png_image.png". I tried the "--page-requisites" option and got my test page, at http://jmmolina.free.fr/t_39638/, perfectly archived. Original names and page is 100% offline browsable. The script name is still "gen_png_image.php". Then I used the "--convert-links" option to see if the script was renamed to a PNG image, it wasn't. To compare this behaviour with HTTrack, I tried to archive the same page with it. By default it converted the PHP script to a HTML page. It's logical because HTTrack has some default ext/MIME mappings. So I removed the ".php to text/html" and got a nice PNG image instead. I don't really know how to force it not to rename the script but it doesn't really matter. As I don't know anything about wget sources, I can't tell how it innerworks but I guess it doesn't check the MIME types of resources linked from the "src" attribute of a "img" elements. And that would be a bug... And I think some kind of RFC or spec should confirm it. JM.