russian codepage in wget 1.10.1

2006-07-17 Thread steelrat
Буду краток /I'm will be brief/:

-( |ver )-

Microsoft Windows XP [Version 5.1.2600]

-( |ver )-

-( |mode CON: CP /STATUS )-

Status for device CON:
--
Code page:  866

-( |mode CON: CP /STATUS )-


-( |"C:\Program Files\GnuWin32\bin\wget.exe" )-

wget: эх търчрэю URL
┬шъюЁшёЄрээ : wget [╧└╨└╠┼?╨]... [URL]...

?яЁюсєщЄх `wget --help' фы  юЄЁшьрээ  фхЄры?эю┐ │эЇюЁьрI│┐.

-( |"C:\Program Files\GnuWin32\bin\wget.exe" )-

-- 
SteelRat


Re: I got one bug on Mac OS X

2006-07-17 Thread HUAZHANG GUO

Thanks, then I am sure that is a Mac OS X Tiger specific problem.



On Jul 15, 2006, at 7:48 PM, Steven P. Ulrick wrote:


On Sat, 15 Jul 2006 16:36:54 -0700
"Tony Lewis" <[EMAIL PROTECTED]> wrote:


I don't think that's valid HTML. According to RFC 1866: An HTML user
agent should treat end of line in any of its variations as a word
space in all contexts except preformatted text.

I don't see any provision for end of line within the HREF attribute
of an A tag.

Tony
  _

From: HUAZHANG GUO [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 11, 2006 7:48 AM
To: [EMAIL PROTECTED]
Subject: I got one bug on Mac OS X


Dear Sir/Madam,

while I was trying to download using the command:

wget -k -np -r -l inf -E http://dasher.wustl.edu/bio5476/

I got most of the files, but lost some of them.

I think I know where the problem is:

if the link is broken into two lines in the index.html:

Lecture 1 (Jan 17): Exploring Conformational Space for  
Biomolecules

http://dasher.wustl.edu/bio5476/lectures
/lecture-01.pdf">[PDF]


I will get the following error message:


--09:13:16--
http://dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf =>
`/Users/hguo/mywww//dasher.wustl.edu/bio5476/lectures%0A/ 
lecture-01.pdf'

Connecting to dasher.wustl.edu[128.252.208.48]:80... connected. HTTP
request sent, awaiting response... 404 Not Found 09:13:16 ERROR 404:
Not Found.

Please note that wget adds a special charactor '%0A' in the URL.
Maybe the Windows new line have one more charactor which is not
recoganized by Mac wget.

I am using Mac OS X, Tigger Darwin.


Hello
I tested the following command:
"wget -k -np -r -l inf -E http://dasher.wustl.edu/bio5476/";
on Fedora Core 5, using wget-1.10.2-3.2.1

I don't know if I got every file or not (since I know nothing about  
the

link that I downloaded) but I did get the file referred to in your
original post: lecture-01.pdf

Here is a link to the full output of wget:
http://www.afolkey2.net/wget.txt

and here is the output for the file that you mentioned as an example:
--19:32:16--  http://dasher.wustl.edu/bio5476/lectures/lecture-01.pdf
Reusing existing connection to dasher.wustl.edu:80.
HTTP request sent, awaiting response... 200 OK
Length: 1755327 (1.7M) [application/pdf]
Saving to: `dasher.wustl.edu/bio5476/lectures/lecture-01.pdf'
1700K ..    100%
462K=3.9s 19:32:20 (438 KB/s) -
`dasher.wustl.edu/bio5476/lectures/lecture-01.pdf' saved
[1755327/1755327]

For everyone's information, I saw that the link was split into two
lines just like the OP described.  The difference between his
experience and mine, though, was that the file with a split URL  
that he

used as an example was downloaded just fine when I tried it.   It
appears that every PDF that has "lecture-" at the beginning of the  
name

has a multi-line URL on the original index.html.  On my experiment,
wget downloaded 25 PDF files that had split (multi-line) URL's.  This
appears to be all of them that are linked to on the index.html page.

Steven P. Ulrick
--
 19:28:50 up 12 days, 23:26,  2 users,  load average: 0.84, 0.86, 0.79




Re: Using --spider to check for dead links?

2006-07-17 Thread Stefan Melbinger

Hi,

First of all thanks for the quick answer! :)

Am 18.07.2006 17:34, Mauro Tortonesi schrieb:

Stefan Melbinger ha scritto:
I need to check whole websites for dead links, with output easy to 
parse for lists of dead links, statistics, etc... Does anybody have 
experience with that problem or has maybe used the --spider mode for 
this before (as suggested by some pages)?

>
historically, wget never really supported recursive --spider mode. 
fortunately, this has been fixed in 1.11-alpha-1:


How will wget react when started in recursive --spider mode? It will 
have to download, parse and delete/forget HTML pages in order to know 
where to go, but what happens with images and large files like videos, 
for example? Will wget check whether they exist?


Thanks a lot,
  Stefan

PS: The background for my question is that my company wants to check 
large websites for dead links (without using any commercial software). 
Hours of Google-searching left me with wget, which seems to have the 
best fundamentals to do this...


Re: Using --spider to check for dead links?

2006-07-17 Thread Mauro Tortonesi

Stefan Melbinger ha scritto:

Hello,

I need to check whole websites for dead links, with output easy to parse 
for lists of dead links, statistics, etc... Does anybody have experience 
with that problem or has maybe used the --spider mode for this before 
(as suggested by some pages)?


If this should work, all HTML pages would have to be parsed completely, 
while pictures and other files should only be HEAD-checked for existence 
(in order to save bandwidth)...


Using --spider and --spider -r was not the right way to do this, I fear.

Any help is appreciated, thanks in advance!


hi stefan,

historically, wget never really supported recursive --spider mode. 
fortunately, this has been fixed in 1.11-alpha-1:


http://www.mail-archive.com/wget@sunsite.dk/msg09071.html

so, it will be included in the upcoming 1.11 release.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Using --spider to check for dead links?

2006-07-17 Thread Stefan Melbinger

Hello,

I need to check whole websites for dead links, with output easy to parse 
for lists of dead links, statistics, etc... Does anybody have experience 
with that problem or has maybe used the --spider mode for this before 
(as suggested by some pages)?


If this should work, all HTML pages would have to be parsed completely, 
while pictures and other files should only be HEAD-checked for existence 
(in order to save bandwidth)...


Using --spider and --spider -r was not the right way to do this, I fear.

Any help is appreciated, thanks in advance!

Greets,
  Stefan Melbinger


Re: wget 1.11 alpha1 - content disposition filename

2006-07-17 Thread Jochen Roderburg
Zitat von Hrvoje Niksic <[EMAIL PROTECTED]>:

> Jochen Roderburg <[EMAIL PROTECTED]> writes:
>
> > E.g, a file which was supposed to have the name B&W.txt came with the
> header:
> > Content-Disposition: attachment; filename=B&W.txt;
> > All programs I tried (the new wget and several browsers and my own script
> ;-)
> > seemed to stop parsing at the first semicolon and produced the filename
> B&.
>
> Unfortunately, if it doesn't work in web browsers, how can it be
> expected to work in Wget?  The server-side software should be fixed.
>

I mainly wanted to hear from some "HTTP/HTML-Experts" that I was correct with my
assumption that the problem here is at the server side  ;-)
Thank you, Mauro and Hrvoje, for confirming that.

Regards, J.Roderburg




Re: "login incorrect"

2006-07-17 Thread Hrvoje Niksic
Mauro Tortonesi <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic ha scritto:
>> Gisle Vanem <[EMAIL PROTECTED]> writes:
>>
>>> Kinda misleading that wget prints "login incorrect" here. Why
>>> couldn't it just print the 530 message?
>> You're completely right. It was an ancient design decision made by me
>> when I wasn't thinking enough (or was thinking the wrong thing).
>
> hrvoje, are you suggesting to extend ftp_login in order to return
> both an error code and an error message?

I didn't have an implementation strategy in mind, but extending
ftp_login sounds like a good idea.


Re: "login incorrect"

2006-07-17 Thread Mauro Tortonesi

Hrvoje Niksic ha scritto:

Gisle Vanem <[EMAIL PROTECTED]> writes:


Kinda misleading that wget prints "login incorrect" here. Why
couldn't it just print the 530 message?


You're completely right. It was an ancient design decision made by me
when I wasn't thinking enough (or was thinking the wrong thing).


hrvoje, are you suggesting to extend ftp_login in order to return both 
an error code and an error message?


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget 1.11 alpha1 - content disposition filename

2006-07-17 Thread Hrvoje Niksic
Jochen Roderburg <[EMAIL PROTECTED]> writes:

> E.g, a file which was supposed to have the name B&W.txt came with the header:
> Content-Disposition: attachment; filename=B&W.txt;
> All programs I tried (the new wget and several browsers and my own script ;-)
> seemed to stop parsing at the first semicolon and produced the filename B&.

Unfortunately, if it doesn't work in web browsers, how can it be
expected to work in Wget?  The server-side software should be fixed.


Re: Wishlist: support the file:/// protocol

2006-07-17 Thread Mauro Tortonesi

David wrote:


In replies to the post requesting support of the “file://” scheme, requests
were made for someone to provide a compelling reason to want to do this. 
Perhaps the following is such a reason.


hi david,

thank you for your interesting example. support for “file://” scheme 
will be very likely introduced in wget 1.12.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: wget 1.11 alpha1 - content disposition filename

2006-07-17 Thread Mauro Tortonesi

Jochen Roderburg ha scritto:

Hi,

I was happy to see that a long missed future was now implemented in this alpha,
namely the interpretaion of the filename in the content dispostion header.
Just recently I had hacked a little script together to achieve this, when I
wanted to download a greater number of files where this was used  ;-)

I had a few cases, however, which did not come out as expected, but I think the
error is this time in the sending web application and not in wget.

E.g, a file which was supposed to have the name B&W.txt came with the header:
Content-Disposition: attachment; filename=B&W.txt;


the error is definitely in the web application. the correct header would be:

Content-Disposition: attachment; filename="B&W.txt";


All programs I tried (the new wget and several browsers and my own script ;-)
seemed to stop parsing at the first semicolon and produced the filename B&.

Any thoughts ??


i think that the filename parsing heuristics currently implemented in 
wget are fine. you really can't do much better in this case.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: Documentation (manpage) "bug"

2006-07-17 Thread Mauro Tortonesi

Linda Walsh ha scritto:

FYI:

On the manpage, where it talks about "no-proxy", the manpage
says:
--no-proxy
  Don't use proxies, even if the appropriate *_proxy environment
  variable is defined.

  For more information about the use of proxies with Wget,
   ^
  -Q quota

Note -- the sentence referring to "more information about the use of
proxies" stops in the middle of saying anything and starts with "-Q quota".


fixed, thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: Excluding directories

2006-07-17 Thread Mauro Tortonesi

Post, Mark K ha scritto:

I'm trying to download parts of the SUSE Linux 10.1 tree.  I'm going
after things below http://suse.mirrors.tds.net/pub/suse/update/10.1/,
but I want to exclude several directories in
http://suse.mirrors.tds.net/pub/suse/update/10.1/rpm/

In that directory are the following subdirectories:
i586/
i686/
noarch/
ppc/
ppc64/
src/
x86_64/

I only want the i586, i686, and noarch directories.  I tried using the
-X parameter, but it only seems to work if I specify " -X
/pub/suse/update/10.1/rpm/ppc,/pub/suse/update/10.1/rpm/ppc64,/pub/suse/
update/10.1/rpm/src,/pub/suse/update/10.1/rpm/x86_64"

Is this the only way it's supposed to work? 


yes.

I was hoping to get away with something along the lines of -X rpm/ppc,rpm/src 
or -X ppc,src and so on.


unfortunately, you'll have to wait until 1.12, which will include 
advanced URL filtering.


--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it


Re: Bug in wget 1.10.2 makefile

2006-07-17 Thread Mauro Tortonesi

Daniel Richard G. ha scritto:

Hello,

The MAKEDEFS value in the top-level Makefile.in also needs to include 
DESTDIR='$(DESTDIR)'.


fixed, thanks.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi  http://www.tortonesi.com

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it