Re: bug retrieving embedded images with --page-requisites

Jean-Marc MOLINA Wed, 09 Nov 2005 07:30:22 -0800

Gavin Sherlock wrote:
> i.e. the image is generated on the fly from a script, which then
> essentially prints the image back to the browser with the correct
> mime type.  While this is a non-standard way to include an image on a
> page, the --page-requisites are not fulfilled when retrieving this
> web page.


I don't think you can consider this a "non-standard way". I'm sure there's a
whole paragraph in a RFC (HTML 4.01 spec) about properly dealing with URI,
linked resources and MIME types. For example if a PNG image is generated
using a "gen_png_image.php" PHP script, I think wget should be able to
download it if the option "--page-requisites" is used, because it's part of
the page and it's not an external resource, get its MIME type, "image/png",
and using the option "--convert-links" should also rename the script-image
to "gen_png_image.png".

I tried the "--page-requisites" option and got my test page, at
http://jmmolina.free.fr/t_39638/, perfectly archived. Original names and
page is 100% offline browsable. The script name is still
"gen_png_image.php". Then I used the "--convert-links" option to see if the
script was renamed to a PNG image, it wasn't.

To compare this behaviour with HTTrack, I tried to archive the same page
with it. By default it converted the PHP script to a HTML page. It's logical
because HTTrack has some default ext/MIME mappings. So I removed the ".php
to text/html" and got a nice PNG image instead. I don't really know how to
force it not to rename the script but it doesn't really matter.

As I don't know anything about wget sources, I can't tell how it innerworks
but I guess it doesn't check the MIME types of resources linked from the
"src" attribute of a "img" elements. And that would be a bug... And I think
some kind of RFC or spec should confirm it.

JM.

Re: bug retrieving embedded images with --page-requisites

Reply via email to