Re: bug retrieving embedded images with --page-requisites
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes: > Hrvoje Niksic wrote: >> More precisely, it doesn't use the file name advertised by the >> Content-Disposition header. That is because Wget decides on the file >> name it will use based on the URL used, *before* the headers are >> downloaded. This unfortunate design decision is the cause of all >> these problems, and will take some work to be undone. > > Implementing the "Content-Disposition" header is on the TODO list : > > * Honor `Content-Disposition: XXX; filename="FILE"' when creating the > file name. If possible, try not to break `-nc' and friends when > doing that. It is, indeed -- I wrote that entry. :-) The problem is that implementing this is not as easy or straightforward as it sounds. This is shared by most TODO list items.
Re: bug retrieving embedded images with --page-requisites
Tony Lewis wrote: > The --convert-links option changes the website path to a local file > system path. That is, it changes the directory, not the file name. Thanks I didn't understand it that way. > IMO, your suggestion has merit, but it would require wget to maintain > a list of MIME types and corresponding renaming rules. Well it seems implementing the "Content-Type" header is planned since a long time and there are two items about it in the "TODO" document of the wget distrib. Maintaining a list of MIME types is not an issue as there are already lists around : * "File suffixes and MIME types" at Duke University : http://www.duke.edu/websrv/file-extensions.html * "MIME Types" category at Google : http://www.google.com/Top/Computers/Data_Formats/MIME_Types * ... Just a word about how HTTrack handles MIME types and extensions. It has a powerful "--assume" option that allows users to assign a MIME type to extensions. For example : "All .php files are PNG images". Everything is explained on the "Option panel : MIME Types" page at http://www.httrack.com/html/step9_opt11.html. I think wget could use such an option. JM.
Re: bug retrieving embedded images with --page-requisites
Hrvoje Niksic wrote: > More precisely, it doesn't use the file name advertised by the > Content-Disposition header. That is because Wget decides on the file > name it will use based on the URL used, *before* the headers are > downloaded. This unfortunate design decision is the cause of all > these problems, and will take some work to be undone. Implementing the "Content-Disposition" header is on the TODO list : * Honor `Content-Disposition: XXX; filename="FILE"' when creating the file name. If possible, try not to break `-nc' and friends when doing that. JM.
RE: bug retrieving embedded images with --page-requisites
Jean-Marc MOLINA wrote: > For example if a PNG image is generated using a "gen_png_image.php" PHP > script, I think wget should be able to download it if the option > "--page-requisites" is used, because it's part of the page and it's not > an external resource, get its MIME type, "image/png", and using the > option "--convert-links" should also rename the script-image to > "gen_png_image.png". The --convert-links option changes the website path to a local file system path. That is, it changes the directory, not the file name. IMO, your suggestion has merit, but it would require wget to maintain a list of MIME types and corresponding renaming rules. Tony
Re: bug retrieving embedded images with --page-requisites
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes: > As I don't know anything about wget sources, I can't tell how it > innerworks but I guess it doesn't check the MIME types of resources > linked from the "src" attribute of a "img" elements. And that would > be a bug... And I think some kind of RFC or spec should confirm it. More precisely, it doesn't use the file name advertised by the Content-Disposition header. That is because Wget decides on the file name it will use based on the URL used, *before* the headers are downloaded. This unfortunate design decision is the cause of all these problems, and will take some work to be undone.
Re: bug retrieving embedded images with --page-requisites
Gavin Sherlock wrote: > i.e. the image is generated on the fly from a script, which then > essentially prints the image back to the browser with the correct > mime type. While this is a non-standard way to include an image on a > page, the --page-requisites are not fulfilled when retrieving this > web page. I don't think you can consider this a "non-standard way". I'm sure there's a whole paragraph in a RFC (HTML 4.01 spec) about properly dealing with URI, linked resources and MIME types. For example if a PNG image is generated using a "gen_png_image.php" PHP script, I think wget should be able to download it if the option "--page-requisites" is used, because it's part of the page and it's not an external resource, get its MIME type, "image/png", and using the option "--convert-links" should also rename the script-image to "gen_png_image.png". I tried the "--page-requisites" option and got my test page, at http://jmmolina.free.fr/t_39638/, perfectly archived. Original names and page is 100% offline browsable. The script name is still "gen_png_image.php". Then I used the "--convert-links" option to see if the script was renamed to a PNG image, it wasn't. To compare this behaviour with HTTrack, I tried to archive the same page with it. By default it converted the PHP script to a HTML page. It's logical because HTTrack has some default ext/MIME mappings. So I removed the ".php to text/html" and got a nice PNG image instead. I don't really know how to force it not to rename the script but it doesn't really matter. As I don't know anything about wget sources, I can't tell how it innerworks but I guess it doesn't check the MIME types of resources linked from the "src" attribute of a "img" elements. And that would be a bug... And I think some kind of RFC or spec should confirm it. JM.