russian codepage in wget 1.10.1
Буду краток /I'm will be brief/: -( |ver )- Microsoft Windows XP [Version 5.1.2600] -( |ver )- -( |mode CON: CP /STATUS )- Status for device CON: -- Code page: 866 -( |mode CON: CP /STATUS )- -( |"C:\Program Files\GnuWin32\bin\wget.exe" )- wget: эх търчрэю URL ┬шъюЁшёЄрээ : wget [╧└╨└╠┼?╨]... [URL]... ?яЁюсєщЄх `wget --help' фы юЄЁшьрээ фхЄры?эю┐ │эЇюЁьрI│┐. -( |"C:\Program Files\GnuWin32\bin\wget.exe" )- -- SteelRat
Re: I got one bug on Mac OS X
Thanks, then I am sure that is a Mac OS X Tiger specific problem. On Jul 15, 2006, at 7:48 PM, Steven P. Ulrick wrote: On Sat, 15 Jul 2006 16:36:54 -0700 "Tony Lewis" <[EMAIL PROTECTED]> wrote: I don't think that's valid HTML. According to RFC 1866: An HTML user agent should treat end of line in any of its variations as a word space in all contexts except preformatted text. I don't see any provision for end of line within the HREF attribute of an A tag. Tony _ From: HUAZHANG GUO [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 11, 2006 7:48 AM To: [EMAIL PROTECTED] Subject: I got one bug on Mac OS X Dear Sir/Madam, while I was trying to download using the command: wget -k -np -r -l inf -E http://dasher.wustl.edu/bio5476/ I got most of the files, but lost some of them. I think I know where the problem is: if the link is broken into two lines in the index.html: Lecture 1 (Jan 17): Exploring Conformational Space for Biomolecules http://dasher.wustl.edu/bio5476/lectures /lecture-01.pdf">[PDF] I will get the following error message: --09:13:16-- http://dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf => `/Users/hguo/mywww//dasher.wustl.edu/bio5476/lectures%0A/ lecture-01.pdf' Connecting to dasher.wustl.edu[128.252.208.48]:80... connected. HTTP request sent, awaiting response... 404 Not Found 09:13:16 ERROR 404: Not Found. Please note that wget adds a special charactor '%0A' in the URL. Maybe the Windows new line have one more charactor which is not recoganized by Mac wget. I am using Mac OS X, Tigger Darwin. Hello I tested the following command: "wget -k -np -r -l inf -E http://dasher.wustl.edu/bio5476/"; on Fedora Core 5, using wget-1.10.2-3.2.1 I don't know if I got every file or not (since I know nothing about the link that I downloaded) but I did get the file referred to in your original post: lecture-01.pdf Here is a link to the full output of wget: http://www.afolkey2.net/wget.txt and here is the output for the file that you mentioned as an example: --19:32:16-- http://dasher.wustl.edu/bio5476/lectures/lecture-01.pdf Reusing existing connection to dasher.wustl.edu:80. HTTP request sent, awaiting response... 200 OK Length: 1755327 (1.7M) [application/pdf] Saving to: `dasher.wustl.edu/bio5476/lectures/lecture-01.pdf' 1700K .. 100% 462K=3.9s 19:32:20 (438 KB/s) - `dasher.wustl.edu/bio5476/lectures/lecture-01.pdf' saved [1755327/1755327] For everyone's information, I saw that the link was split into two lines just like the OP described. The difference between his experience and mine, though, was that the file with a split URL that he used as an example was downloaded just fine when I tried it. It appears that every PDF that has "lecture-" at the beginning of the name has a multi-line URL on the original index.html. On my experiment, wget downloaded 25 PDF files that had split (multi-line) URL's. This appears to be all of them that are linked to on the index.html page. Steven P. Ulrick -- 19:28:50 up 12 days, 23:26, 2 users, load average: 0.84, 0.86, 0.79
Re: Using --spider to check for dead links?
Hi, First of all thanks for the quick answer! :) Am 18.07.2006 17:34, Mauro Tortonesi schrieb: Stefan Melbinger ha scritto: I need to check whole websites for dead links, with output easy to parse for lists of dead links, statistics, etc... Does anybody have experience with that problem or has maybe used the --spider mode for this before (as suggested by some pages)? > historically, wget never really supported recursive --spider mode. fortunately, this has been fixed in 1.11-alpha-1: How will wget react when started in recursive --spider mode? It will have to download, parse and delete/forget HTML pages in order to know where to go, but what happens with images and large files like videos, for example? Will wget check whether they exist? Thanks a lot, Stefan PS: The background for my question is that my company wants to check large websites for dead links (without using any commercial software). Hours of Google-searching left me with wget, which seems to have the best fundamentals to do this...
Re: Using --spider to check for dead links?
Stefan Melbinger ha scritto: Hello, I need to check whole websites for dead links, with output easy to parse for lists of dead links, statistics, etc... Does anybody have experience with that problem or has maybe used the --spider mode for this before (as suggested by some pages)? If this should work, all HTML pages would have to be parsed completely, while pictures and other files should only be HEAD-checked for existence (in order to save bandwidth)... Using --spider and --spider -r was not the right way to do this, I fear. Any help is appreciated, thanks in advance! hi stefan, historically, wget never really supported recursive --spider mode. fortunately, this has been fixed in 1.11-alpha-1: http://www.mail-archive.com/wget@sunsite.dk/msg09071.html so, it will be included in the upcoming 1.11 release. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Using --spider to check for dead links?
Hello, I need to check whole websites for dead links, with output easy to parse for lists of dead links, statistics, etc... Does anybody have experience with that problem or has maybe used the --spider mode for this before (as suggested by some pages)? If this should work, all HTML pages would have to be parsed completely, while pictures and other files should only be HEAD-checked for existence (in order to save bandwidth)... Using --spider and --spider -r was not the right way to do this, I fear. Any help is appreciated, thanks in advance! Greets, Stefan Melbinger
Re: wget 1.11 alpha1 - content disposition filename
Zitat von Hrvoje Niksic <[EMAIL PROTECTED]>: > Jochen Roderburg <[EMAIL PROTECTED]> writes: > > > E.g, a file which was supposed to have the name B&W.txt came with the > header: > > Content-Disposition: attachment; filename=B&W.txt; > > All programs I tried (the new wget and several browsers and my own script > ;-) > > seemed to stop parsing at the first semicolon and produced the filename > B&. > > Unfortunately, if it doesn't work in web browsers, how can it be > expected to work in Wget? The server-side software should be fixed. > I mainly wanted to hear from some "HTTP/HTML-Experts" that I was correct with my assumption that the problem here is at the server side ;-) Thank you, Mauro and Hrvoje, for confirming that. Regards, J.Roderburg
Re: "login incorrect"
Mauro Tortonesi <[EMAIL PROTECTED]> writes: > Hrvoje Niksic ha scritto: >> Gisle Vanem <[EMAIL PROTECTED]> writes: >> >>> Kinda misleading that wget prints "login incorrect" here. Why >>> couldn't it just print the 530 message? >> You're completely right. It was an ancient design decision made by me >> when I wasn't thinking enough (or was thinking the wrong thing). > > hrvoje, are you suggesting to extend ftp_login in order to return > both an error code and an error message? I didn't have an implementation strategy in mind, but extending ftp_login sounds like a good idea.
Re: "login incorrect"
Hrvoje Niksic ha scritto: Gisle Vanem <[EMAIL PROTECTED]> writes: Kinda misleading that wget prints "login incorrect" here. Why couldn't it just print the 530 message? You're completely right. It was an ancient design decision made by me when I wasn't thinking enough (or was thinking the wrong thing). hrvoje, are you suggesting to extend ftp_login in order to return both an error code and an error message? -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: wget 1.11 alpha1 - content disposition filename
Jochen Roderburg <[EMAIL PROTECTED]> writes: > E.g, a file which was supposed to have the name B&W.txt came with the header: > Content-Disposition: attachment; filename=B&W.txt; > All programs I tried (the new wget and several browsers and my own script ;-) > seemed to stop parsing at the first semicolon and produced the filename B&. Unfortunately, if it doesn't work in web browsers, how can it be expected to work in Wget? The server-side software should be fixed.
Re: Wishlist: support the file:/// protocol
David wrote: In replies to the post requesting support of the “file://” scheme, requests were made for someone to provide a compelling reason to want to do this. Perhaps the following is such a reason. hi david, thank you for your interesting example. support for “file://” scheme will be very likely introduced in wget 1.12. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: wget 1.11 alpha1 - content disposition filename
Jochen Roderburg ha scritto: Hi, I was happy to see that a long missed future was now implemented in this alpha, namely the interpretaion of the filename in the content dispostion header. Just recently I had hacked a little script together to achieve this, when I wanted to download a greater number of files where this was used ;-) I had a few cases, however, which did not come out as expected, but I think the error is this time in the sending web application and not in wget. E.g, a file which was supposed to have the name B&W.txt came with the header: Content-Disposition: attachment; filename=B&W.txt; the error is definitely in the web application. the correct header would be: Content-Disposition: attachment; filename="B&W.txt"; All programs I tried (the new wget and several browsers and my own script ;-) seemed to stop parsing at the first semicolon and produced the filename B&. Any thoughts ?? i think that the filename parsing heuristics currently implemented in wget are fine. you really can't do much better in this case. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: Documentation (manpage) "bug"
Linda Walsh ha scritto: FYI: On the manpage, where it talks about "no-proxy", the manpage says: --no-proxy Don't use proxies, even if the appropriate *_proxy environment variable is defined. For more information about the use of proxies with Wget, ^ -Q quota Note -- the sentence referring to "more information about the use of proxies" stops in the middle of saying anything and starts with "-Q quota". fixed, thanks. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: Excluding directories
Post, Mark K ha scritto: I'm trying to download parts of the SUSE Linux 10.1 tree. I'm going after things below http://suse.mirrors.tds.net/pub/suse/update/10.1/, but I want to exclude several directories in http://suse.mirrors.tds.net/pub/suse/update/10.1/rpm/ In that directory are the following subdirectories: i586/ i686/ noarch/ ppc/ ppc64/ src/ x86_64/ I only want the i586, i686, and noarch directories. I tried using the -X parameter, but it only seems to work if I specify " -X /pub/suse/update/10.1/rpm/ppc,/pub/suse/update/10.1/rpm/ppc64,/pub/suse/ update/10.1/rpm/src,/pub/suse/update/10.1/rpm/x86_64" Is this the only way it's supposed to work? yes. I was hoping to get away with something along the lines of -X rpm/ppc,rpm/src or -X ppc,src and so on. unfortunately, you'll have to wait until 1.12, which will include advanced URL filtering. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: Bug in wget 1.10.2 makefile
Daniel Richard G. ha scritto: Hello, The MAKEDEFS value in the top-level Makefile.in also needs to include DESTDIR='$(DESTDIR)'. fixed, thanks. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it