Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Hrvoje Niksic
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic wrote:
>> More precisely, it doesn't use the file name advertised by the
>> Content-Disposition header.  That is because Wget decides on the file
>> name it will use based on the URL used, *before* the headers are
>> downloaded.  This unfortunate design decision is the cause of all
>> these problems, and will take some work to be undone.
>
> Implementing the "Content-Disposition" header is on the TODO list :
>
> * Honor `Content-Disposition: XXX; filename="FILE"' when creating the
>   file name.  If possible, try not to break `-nc' and friends when
>   doing that.

It is, indeed -- I wrote that entry.  :-)  The problem is that
implementing this is not as easy or straightforward as it sounds.
This is shared by most TODO list items.


Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Jean-Marc MOLINA
Tony Lewis wrote:
> The --convert-links option changes the website path to a local file
> system path. That is, it changes the directory, not the file name.

Thanks I didn't understand it that way.

> IMO, your suggestion has merit, but it would require wget to maintain
> a list of MIME types and corresponding renaming rules.

Well it seems implementing the "Content-Type" header is planned since a long
time and there are two items about it in the "TODO" document of the wget
distrib.

Maintaining a list of MIME types is not an issue as there are already lists
around :
* "File suffixes and MIME types" at Duke University :
http://www.duke.edu/websrv/file-extensions.html
* "MIME Types" category at Google :
http://www.google.com/Top/Computers/Data_Formats/MIME_Types
* ...

Just a word about how HTTrack handles MIME types and extensions. It has a
powerful "--assume" option that allows users to assign a MIME type to
extensions. For example : "All .php files are PNG images". Everything is
explained on the "Option panel : MIME Types" page at
http://www.httrack.com/html/step9_opt11.html. I think wget could use such an
option.

JM.





Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Jean-Marc MOLINA
Hrvoje Niksic wrote:
> More precisely, it doesn't use the file name advertised by the
> Content-Disposition header.  That is because Wget decides on the file
> name it will use based on the URL used, *before* the headers are
> downloaded.  This unfortunate design decision is the cause of all
> these problems, and will take some work to be undone.

Implementing the "Content-Disposition" header is on the TODO list :

* Honor `Content-Disposition: XXX; filename="FILE"' when creating the
  file name.  If possible, try not to break `-nc' and friends when
  doing that.

JM.





RE: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Tony Lewis
Jean-Marc MOLINA wrote:

> For example if a PNG image is generated using a "gen_png_image.php" PHP
> script, I think wget should be able to download it if the option
> "--page-requisites" is used, because it's part of the page and it's not
> an external resource, get its MIME type, "image/png", and using the
> option "--convert-links" should also rename the script-image to
> "gen_png_image.png".

The --convert-links option changes the website path to a local file system
path. That is, it changes the directory, not the file name. IMO, your
suggestion has merit, but it would require wget to maintain a list of MIME
types and corresponding renaming rules.

Tony




Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Hrvoje Niksic
"Jean-Marc MOLINA" <[EMAIL PROTECTED]> writes:

> As I don't know anything about wget sources, I can't tell how it
> innerworks but I guess it doesn't check the MIME types of resources
> linked from the "src" attribute of a "img" elements. And that would
> be a bug... And I think some kind of RFC or spec should confirm it.

More precisely, it doesn't use the file name advertised by the
Content-Disposition header.  That is because Wget decides on the file
name it will use based on the URL used, *before* the headers are
downloaded.  This unfortunate design decision is the cause of all
these problems, and will take some work to be undone.


Re: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Jean-Marc MOLINA
Gavin Sherlock wrote:
> i.e. the image is generated on the fly from a script, which then
> essentially prints the image back to the browser with the correct
> mime type.  While this is a non-standard way to include an image on a
> page, the --page-requisites are not fulfilled when retrieving this
> web page.

I don't think you can consider this a "non-standard way". I'm sure there's a
whole paragraph in a RFC (HTML 4.01 spec) about properly dealing with URI,
linked resources and MIME types. For example if a PNG image is generated
using a "gen_png_image.php" PHP script, I think wget should be able to
download it if the option "--page-requisites" is used, because it's part of
the page and it's not an external resource, get its MIME type, "image/png",
and using the option "--convert-links" should also rename the script-image
to "gen_png_image.png".

I tried the "--page-requisites" option and got my test page, at
http://jmmolina.free.fr/t_39638/, perfectly archived. Original names and
page is 100% offline browsable. The script name is still
"gen_png_image.php". Then I used the "--convert-links" option to see if the
script was renamed to a PNG image, it wasn't.

To compare this behaviour with HTTrack, I tried to archive the same page
with it. By default it converted the PHP script to a HTML page. It's logical
because HTTrack has some default ext/MIME mappings. So I removed the ".php
to text/html" and got a nice PNG image instead. I don't really know how to
force it not to rename the script but it doesn't really matter.

As I don't know anything about wget sources, I can't tell how it innerworks
but I guess it doesn't check the MIME types of resources linked from the
"src" attribute of a "img" elements. And that would be a bug... And I think
some kind of RFC or spec should confirm it.

JM.





bug retrieving embedded images with --page-requisites

2005-10-05 Thread Gavin Sherlock

Hi,

The following seems to not be expected behavior:

wget --page-requisites --no-clobber --no-directories --no-host- 
directories --convert-links http://www.candidagenome.org/cgi-bin/ 
locus.pl?locus=HWP1


Two of the images on that page do not get downloaded, and then the  
links within the page get converted to local viewing of them.  The  
html that includes the images is actually somewhat tricky, e.g.:





i.e. the image is generated on the fly from a script, which then  
essentially prints the image back to the browser with the correct  
mime type.  While this is a non-standard way to include an image on a  
page, the --page-requisites are not fulfilled when retrieving this  
web page.


hypha 111 > wget -V
GNU Wget 1.10.1

Copyright (C) 2005 Free Software Foundation, Inc.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

Originally written by Hrvoje Niksic <[EMAIL PROTECTED]>.

hypha 112 > uname -a
SunOS hypha 5.9 Generic_118558-06 sun4u sparc SUNW,Sun-Fire-V440

Many thanks for wget - it's an excellent tool,

Cheers,
Gavin


Gavin Sherlock
Dept. of Genetics
Center for Clinical Sciences Research
269 Campus Drive,
Room 2255b,
Stanford,
CA 94305-5120

Tel: 650 498 6012
Fax: 650 724 3701