SSL site mirroring

2001-12-29 Thread Thomas Reinke

Ok, either I've completely misread wget, or it has a problem
mirroring SSL sites.  It appears that it is deciding that
the https:// scheme is something that is "not to be followed".

For those interested, the offending code appears to be 3 lines
in recur.c, which, if changed treat the HTTPS schema the same
way that the HTTP schema is treated with respect to following
links and in-line content:

  Line 440: change to
 if (u->scheme != SCHEME_HTTP && u->scheme!= SCHEME_HTTPS 

  Line 449: change to
if (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS) 

  Line 537: change to
if (opt.use_robots && (u->scheme == SCHEME_HTTP || u->scheme ==
SCHEME_HTTPS))

For those wanting a patch, one that can be applied to the 1.8.1
source distribution is attached.

Thomas

--- src/recur.c Wed Dec 19 09:27:29 2001
+++ ../wget-1.8.1.esoft/src/recur.c Sat Dec 29 16:17:40 2001
@@ -437,7 +437,7 @@
  the list.  */
 
   /* 1. Schemes other than HTTP are normally not recursed into. */
-  if (u->scheme != SCHEME_HTTP
+  if (u->scheme != SCHEME_HTTP && u->scheme!= SCHEME_HTTPS
   && !(u->scheme == SCHEME_FTP && opt.follow_ftp))
 {
   DEBUGP (("Not following non-HTTP schemes.\n"));
@@ -446,7 +446,7 @@
 
   /* 2. If it is an absolute link and they are not followed, throw it
  out.  */
-  if (u->scheme == SCHEME_HTTP)
+  if (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS)
 if (opt.relative_only && !upos->link_relative_p)
   {
DEBUGP (("It doesn't really look like a relative link.\n"));
@@ -534,7 +534,7 @@
   }
 
   /* 8. */
-  if (opt.use_robots && u->scheme == SCHEME_HTTP)
+  if (opt.use_robots && (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS))
 {
   struct robot_specs *specs = res_get_specs (u->host, u->port);
   if (!specs)


2001-12-29  Thomas Reinke  <[EMAIL PROTECTED]>

* recur.c: fixed scheme handling for https to allow proper following
of links





SSL sites fail to be crawled

2001-12-29 Thread Thomas Reinke

It seems that SSL sites aren't crawled properly, because wget
decides that the scheme is not to be followed. Offending code
appears to be limited to only 3 lines located in recur.c:
(version 1.8.1)

  Line 440: change to
 if (u->scheme != SCHEME_HTTP && u->scheme!= SCHEME_HTTPS 

  Line 449: change to
if (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS) 

  Line 537: change to
if (opt.use_robots && (u->scheme == SCHEME_HTTP || u->scheme ==
SCHEME_HTTPS))
 
Thomas



Re: How to save what I see?

2001-12-29 Thread Stefan Bender

Hello ...

... try using the
"-p,  --page-requisites get all images, etc. needed to display HTML page"
option (and wget >=1.6)

Bye


On Sat, 29 Dec 2001, Robin B. Lake wrote:

> I'm using wget to save a "tick chart" of a stock index each night.
> 
> wget -nH -q -O /QoI/working/CHARTS/$myday+OEX.html 
>'http://bigcharts.marketwatch.com/quickchart/quickchart.asp?symb=%24OEX&sid=0&o_symb=%24OEX&x=60&y=15&freq=9&time=1'
>   
> 
> The Web site returns an image, whose HTML is:
> SRC="http://chart.bigcharts.com/bc3/quickchart/chart.asp?symb=GE&compidx=a%3A0&ma=0&maval=9&uf=0&lf=1&lf2=0&lf3=0&type=2&size=2&state=8&sid=^C48
> &style=320&time=1&freq=9&nosettings=1&rand=6148&mocktick=1&rand=4692" BORDER="0" 
>WIDTH="579" HEIGHT="335">
> 
> (I had to break the line for my e-mail editor).
> 
> What is saved by Wget is the HTML, so that when I go to get the saved image,
> say a week from now, what I get is the image for that future day, not for
> the day I saved!
> 
> Is there a way to get Wget to save the image?
> 
> Thanks,
> Robin Lake
> [EMAIL PROTECTED]
> 
> 




How to save what I see?

2001-12-29 Thread Robin B. Lake

I'm using wget to save a "tick chart" of a stock index each night.

wget -nH -q -O /QoI/working/CHARTS/$myday+OEX.html 
'http://bigcharts.marketwatch.com/quickchart/quickchart.asp?symb=%24OEX&sid=0&o_symb=%24OEX&x=60&y=15&freq=9&time=1'
   

The Web site returns an image, whose HTML is:
http://chart.bigcharts.com/bc3/quickchart/chart.asp?symb=GE&compidx=a%3A0&ma=0&maval=9&uf=0&lf=1&lf2=0&lf3=0&type=2&size=2&state=8&sid=^C48
&style=320&time=1&freq=9&nosettings=1&rand=6148&mocktick=1&rand=4692" BORDER="0" 
WIDTH="579" HEIGHT="335">

(I had to break the line for my e-mail editor).

What is saved by Wget is the HTML, so that when I go to get the saved image,
say a week from now, what I get is the image for that future day, not for
the day I saved!

Is there a way to get Wget to save the image?

Thanks,
Robin Lake
[EMAIL PROTECTED]




Re: Assertion failure in wget 1.8, recur.c:753

2001-12-29 Thread Hrvoje Niksic

Thomas Reinke <[EMAIL PROTECTED]> writes:

> Neat...not sure that I really nkown enough to start digging to easily
> figure out what went wrong, but it can be reproduced by running the
> following:
> 
> $ wget -d -r -l 5 -t 1 -T 30 -o x.lg -p -s -P dir -Q 500 
> --limit-rate=256000 -R mpg,mpeg http://www.netcraft.co.uk
> 
> wget: recur.c:753: register_download: Assertion `!hash_table_contains 
> (dl_url_file_map, url)' failed. Aborted (core dumped)

Please try version 1.8.1; I believe this bug has been fixed.



Re: recursive ftp via proxy problem in wget 1.8.1

2001-12-29 Thread Hrvoje Niksic

"Jiang Wei" <[EMAIL PROTECTED]> writes:

> I tried to download a whole directory in a FTP site by using `-r -np'
> options, and I have go through some firewall
> via http_proxy/ftp_proxy. But I failed, wget-1.8.1 only retrieved the
> first indexed ftp file list and stopped working, while wget-1.5.3 can
> download all files with same options.
> 
> I read some code of wget-1.8.1. Started from line 806 of src/main.c,
> wget determine url scheme, HTTP to call retrieve_tree() while FTP to
> call retrieve_url() without consideration for proxied FTP's HTTP
> scheme.

You're right.  We'll have to either move the decision whether to call
retrieve_tree into retrieve_url, or move the proxy logic out of
retrieve_url.

I'll try to fix this when I have the time.



Re: Bug if current folder don't existe

2001-12-29 Thread Hrvoje Niksic

Jean-Edouard BABIN <[EMAIL PROTECTED]> writes:

> I found a little bug when we download from an deleted directory:
[...]

Thanks for the report.

I wouldn't consider it a real bug.  Downloading things into a deleted
directory is bound to produce all kinds of problems.

The diagnostic message could perhaps be improved, but I don't consider
the case of downloading into deleted directories to be all that
frequent.  The IO code is always hard, and diagnostics will never be
completely in sync with reality.



Re: Just a Question

2001-12-29 Thread Hrvoje Niksic

Edward Manukovsky <[EMAIL PROTECTED]> writes:

> Excuse me, please, but I've got a question.
> I cannot set retry timeout for 30 seconds by doing:
> wget -w30 -T600 -c -b -t0 -S -alist.log -iurl_list

For me, Wget waits for 30 seconds between each retrieval.  What
version are you using?



Re: [Wget]: Bug submission

2001-12-29 Thread Hrvoje Niksic

[ Please mail bug reports to <[EMAIL PROTECTED]>, not to me directly. ]

Nuno Ponte <[EMAIL PROTECTED]> writes:

> I get a segmentation fault when invoking:
> 
> wget -r
> http://java.sun.com/docs/books/performance/1st_edition/html/JPTOC.fm.html
> 
> My Wget version is 1.7-3, the one which is bundled with RedHat
> 7.2. I attached my .wgetrc.

Wget 1.7 is fairly old -- it was followed by a bugfix 1.7.1 release,
and then 1.8 and 1.8.1.  Please try upgrading to the latest version,
1.8.1, and see if the bug repeats.  I couldn't repeat it with 1.8.1.



subscribe please

2001-12-29 Thread Alexander Auinger