SSL site mirroring
Ok, either I've completely misread wget, or it has a problem mirroring SSL sites. It appears that it is deciding that the https:// scheme is something that is "not to be followed". For those interested, the offending code appears to be 3 lines in recur.c, which, if changed treat the HTTPS schema the same way that the HTTP schema is treated with respect to following links and in-line content: Line 440: change to if (u->scheme != SCHEME_HTTP && u->scheme!= SCHEME_HTTPS Line 449: change to if (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS) Line 537: change to if (opt.use_robots && (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS)) For those wanting a patch, one that can be applied to the 1.8.1 source distribution is attached. Thomas --- src/recur.c Wed Dec 19 09:27:29 2001 +++ ../wget-1.8.1.esoft/src/recur.c Sat Dec 29 16:17:40 2001 @@ -437,7 +437,7 @@ the list. */ /* 1. Schemes other than HTTP are normally not recursed into. */ - if (u->scheme != SCHEME_HTTP + if (u->scheme != SCHEME_HTTP && u->scheme!= SCHEME_HTTPS && !(u->scheme == SCHEME_FTP && opt.follow_ftp)) { DEBUGP (("Not following non-HTTP schemes.\n")); @@ -446,7 +446,7 @@ /* 2. If it is an absolute link and they are not followed, throw it out. */ - if (u->scheme == SCHEME_HTTP) + if (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS) if (opt.relative_only && !upos->link_relative_p) { DEBUGP (("It doesn't really look like a relative link.\n")); @@ -534,7 +534,7 @@ } /* 8. */ - if (opt.use_robots && u->scheme == SCHEME_HTTP) + if (opt.use_robots && (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS)) { struct robot_specs *specs = res_get_specs (u->host, u->port); if (!specs) 2001-12-29 Thomas Reinke <[EMAIL PROTECTED]> * recur.c: fixed scheme handling for https to allow proper following of links
SSL sites fail to be crawled
It seems that SSL sites aren't crawled properly, because wget decides that the scheme is not to be followed. Offending code appears to be limited to only 3 lines located in recur.c: (version 1.8.1) Line 440: change to if (u->scheme != SCHEME_HTTP && u->scheme!= SCHEME_HTTPS Line 449: change to if (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS) Line 537: change to if (opt.use_robots && (u->scheme == SCHEME_HTTP || u->scheme == SCHEME_HTTPS)) Thomas
Re: How to save what I see?
Hello ... ... try using the "-p, --page-requisites get all images, etc. needed to display HTML page" option (and wget >=1.6) Bye On Sat, 29 Dec 2001, Robin B. Lake wrote: > I'm using wget to save a "tick chart" of a stock index each night. > > wget -nH -q -O /QoI/working/CHARTS/$myday+OEX.html >'http://bigcharts.marketwatch.com/quickchart/quickchart.asp?symb=%24OEX&sid=0&o_symb=%24OEX&x=60&y=15&freq=9&time=1' > > > The Web site returns an image, whose HTML is: > SRC="http://chart.bigcharts.com/bc3/quickchart/chart.asp?symb=GE&compidx=a%3A0&ma=0&maval=9&uf=0&lf=1&lf2=0&lf3=0&type=2&size=2&state=8&sid=^C48 > &style=320&time=1&freq=9&nosettings=1&rand=6148&mocktick=1&rand=4692" BORDER="0" >WIDTH="579" HEIGHT="335"> > > (I had to break the line for my e-mail editor). > > What is saved by Wget is the HTML, so that when I go to get the saved image, > say a week from now, what I get is the image for that future day, not for > the day I saved! > > Is there a way to get Wget to save the image? > > Thanks, > Robin Lake > [EMAIL PROTECTED] > >
How to save what I see?
I'm using wget to save a "tick chart" of a stock index each night. wget -nH -q -O /QoI/working/CHARTS/$myday+OEX.html 'http://bigcharts.marketwatch.com/quickchart/quickchart.asp?symb=%24OEX&sid=0&o_symb=%24OEX&x=60&y=15&freq=9&time=1' The Web site returns an image, whose HTML is: http://chart.bigcharts.com/bc3/quickchart/chart.asp?symb=GE&compidx=a%3A0&ma=0&maval=9&uf=0&lf=1&lf2=0&lf3=0&type=2&size=2&state=8&sid=^C48 &style=320&time=1&freq=9&nosettings=1&rand=6148&mocktick=1&rand=4692" BORDER="0" WIDTH="579" HEIGHT="335"> (I had to break the line for my e-mail editor). What is saved by Wget is the HTML, so that when I go to get the saved image, say a week from now, what I get is the image for that future day, not for the day I saved! Is there a way to get Wget to save the image? Thanks, Robin Lake [EMAIL PROTECTED]
Re: Assertion failure in wget 1.8, recur.c:753
Thomas Reinke <[EMAIL PROTECTED]> writes: > Neat...not sure that I really nkown enough to start digging to easily > figure out what went wrong, but it can be reproduced by running the > following: > > $ wget -d -r -l 5 -t 1 -T 30 -o x.lg -p -s -P dir -Q 500 > --limit-rate=256000 -R mpg,mpeg http://www.netcraft.co.uk > > wget: recur.c:753: register_download: Assertion `!hash_table_contains > (dl_url_file_map, url)' failed. Aborted (core dumped) Please try version 1.8.1; I believe this bug has been fixed.
Re: recursive ftp via proxy problem in wget 1.8.1
"Jiang Wei" <[EMAIL PROTECTED]> writes: > I tried to download a whole directory in a FTP site by using `-r -np' > options, and I have go through some firewall > via http_proxy/ftp_proxy. But I failed, wget-1.8.1 only retrieved the > first indexed ftp file list and stopped working, while wget-1.5.3 can > download all files with same options. > > I read some code of wget-1.8.1. Started from line 806 of src/main.c, > wget determine url scheme, HTTP to call retrieve_tree() while FTP to > call retrieve_url() without consideration for proxied FTP's HTTP > scheme. You're right. We'll have to either move the decision whether to call retrieve_tree into retrieve_url, or move the proxy logic out of retrieve_url. I'll try to fix this when I have the time.
Re: Bug if current folder don't existe
Jean-Edouard BABIN <[EMAIL PROTECTED]> writes: > I found a little bug when we download from an deleted directory: [...] Thanks for the report. I wouldn't consider it a real bug. Downloading things into a deleted directory is bound to produce all kinds of problems. The diagnostic message could perhaps be improved, but I don't consider the case of downloading into deleted directories to be all that frequent. The IO code is always hard, and diagnostics will never be completely in sync with reality.
Re: Just a Question
Edward Manukovsky <[EMAIL PROTECTED]> writes: > Excuse me, please, but I've got a question. > I cannot set retry timeout for 30 seconds by doing: > wget -w30 -T600 -c -b -t0 -S -alist.log -iurl_list For me, Wget waits for 30 seconds between each retrieval. What version are you using?
Re: [Wget]: Bug submission
[ Please mail bug reports to <[EMAIL PROTECTED]>, not to me directly. ] Nuno Ponte <[EMAIL PROTECTED]> writes: > I get a segmentation fault when invoking: > > wget -r > http://java.sun.com/docs/books/performance/1st_edition/html/JPTOC.fm.html > > My Wget version is 1.7-3, the one which is bundled with RedHat > 7.2. I attached my .wgetrc. Wget 1.7 is fairly old -- it was followed by a bugfix 1.7.1 release, and then 1.8 and 1.8.1. Please try upgrading to the latest version, 1.8.1, and see if the bug repeats. I couldn't repeat it with 1.8.1.