Re: bug retrieving embedded images with --page-requisites
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes: > Hrvoje Niksic wrote: >> More precisely, it doesn't use the file name advertised by the >> Content-Disposition header. That is because Wget decides on the file >> name it will use based on the URL used, *before* the headers are >> downloaded. This unfortunate design decision is the cause of all >> these problems, and will take some work to be undone. > > Implementing the "Content-Disposition" header is on the TODO list : > > * Honor `Content-Disposition: XXX; filename="FILE"' when creating the > file name. If possible, try not to break `-nc' and friends when > doing that. It is, indeed -- I wrote that entry. :-) The problem is that implementing this is not as easy or straightforward as it sounds. This is shared by most TODO list items.
Re: bug retrieving embedded images with --page-requisites
Tony Lewis wrote: > The --convert-links option changes the website path to a local file > system path. That is, it changes the directory, not the file name. Thanks I didn't understand it that way. > IMO, your suggestion has merit, but it would require wget to maintain > a list of MIME types and corresponding renaming rules. Well it seems implementing the "Content-Type" header is planned since a long time and there are two items about it in the "TODO" document of the wget distrib. Maintaining a list of MIME types is not an issue as there are already lists around : * "File suffixes and MIME types" at Duke University : http://www.duke.edu/websrv/file-extensions.html * "MIME Types" category at Google : http://www.google.com/Top/Computers/Data_Formats/MIME_Types * ... Just a word about how HTTrack handles MIME types and extensions. It has a powerful "--assume" option that allows users to assign a MIME type to extensions. For example : "All .php files are PNG images". Everything is explained on the "Option panel : MIME Types" page at http://www.httrack.com/html/step9_opt11.html. I think wget could use such an option. JM.
Re: bug retrieving embedded images with --page-requisites
Hrvoje Niksic wrote: > More precisely, it doesn't use the file name advertised by the > Content-Disposition header. That is because Wget decides on the file > name it will use based on the URL used, *before* the headers are > downloaded. This unfortunate design decision is the cause of all > these problems, and will take some work to be undone. Implementing the "Content-Disposition" header is on the TODO list : * Honor `Content-Disposition: XXX; filename="FILE"' when creating the file name. If possible, try not to break `-nc' and friends when doing that. JM.
RE: bug retrieving embedded images with --page-requisites
Jean-Marc MOLINA wrote: > For example if a PNG image is generated using a "gen_png_image.php" PHP > script, I think wget should be able to download it if the option > "--page-requisites" is used, because it's part of the page and it's not > an external resource, get its MIME type, "image/png", and using the > option "--convert-links" should also rename the script-image to > "gen_png_image.png". The --convert-links option changes the website path to a local file system path. That is, it changes the directory, not the file name. IMO, your suggestion has merit, but it would require wget to maintain a list of MIME types and corresponding renaming rules. Tony
Re: bug retrieving embedded images with --page-requisites
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes: > As I don't know anything about wget sources, I can't tell how it > innerworks but I guess it doesn't check the MIME types of resources > linked from the "src" attribute of a "img" elements. And that would > be a bug... And I think some kind of RFC or spec should confirm it. More precisely, it doesn't use the file name advertised by the Content-Disposition header. That is because Wget decides on the file name it will use based on the URL used, *before* the headers are downloaded. This unfortunate design decision is the cause of all these problems, and will take some work to be undone.
Re: bug retrieving embedded images with --page-requisites
Gavin Sherlock wrote: > i.e. the image is generated on the fly from a script, which then > essentially prints the image back to the browser with the correct > mime type. While this is a non-standard way to include an image on a > page, the --page-requisites are not fulfilled when retrieving this > web page. I don't think you can consider this a "non-standard way". I'm sure there's a whole paragraph in a RFC (HTML 4.01 spec) about properly dealing with URI, linked resources and MIME types. For example if a PNG image is generated using a "gen_png_image.php" PHP script, I think wget should be able to download it if the option "--page-requisites" is used, because it's part of the page and it's not an external resource, get its MIME type, "image/png", and using the option "--convert-links" should also rename the script-image to "gen_png_image.png". I tried the "--page-requisites" option and got my test page, at http://jmmolina.free.fr/t_39638/, perfectly archived. Original names and page is 100% offline browsable. The script name is still "gen_png_image.php". Then I used the "--convert-links" option to see if the script was renamed to a PNG image, it wasn't. To compare this behaviour with HTTrack, I tried to archive the same page with it. By default it converted the PHP script to a HTML page. It's logical because HTTrack has some default ext/MIME mappings. So I removed the ".php to text/html" and got a nice PNG image instead. I don't really know how to force it not to rename the script but it doesn't really matter. As I don't know anything about wget sources, I can't tell how it innerworks but I guess it doesn't check the MIME types of resources linked from the "src" attribute of a "img" elements. And that would be a bug... And I think some kind of RFC or spec should confirm it. JM.
bug retrieving embedded images with --page-requisites
Hi, The following seems to not be expected behavior: wget --page-requisites --no-clobber --no-directories --no-host- directories --convert-links http://www.candidagenome.org/cgi-bin/ locus.pl?locus=HWP1 Two of the images on that page do not get downloaded, and then the links within the page get converted to local viewing of them. The html that includes the images is actually somewhat tricky, e.g.: i.e. the image is generated on the fly from a script, which then essentially prints the image back to the browser with the correct mime type. While this is a non-standard way to include an image on a page, the --page-requisites are not fulfilled when retrieving this web page. hypha 111 > wget -V GNU Wget 1.10.1 Copyright (C) 2005 Free Software Foundation, Inc. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Originally written by Hrvoje Niksic <[EMAIL PROTECTED]>. hypha 112 > uname -a SunOS hypha 5.9 Generic_118558-06 sun4u sparc SUNW,Sun-Fire-V440 Many thanks for wget - it's an excellent tool, Cheers, Gavin Gavin Sherlock Dept. of Genetics Center for Clinical Sciences Research 269 Campus Drive, Room 2255b, Stanford, CA 94305-5120 Tel: 650 498 6012 Fax: 650 724 3701