Re: wget overwriting even with -c (bug?)

2004-06-12 Thread Petr Kadlec
H, sorry, I have just discovered that it has been reported about a week 
ago (http://www.mail-archive.com/wget%40sunsite.dk/msg06527.html). I really 
did try to search for some "overwrite", etc. in the archive, honestly. :-) 
But that e-mail does not use the word "overwrite" at all...

Regards,
Petr Kadlec
--
Yield to temptation, it may not pass your way again.
--
Petr Kadlec <[EMAIL PROTECTED]>
ICQ #68196926, http://mormegil.wz.cz/


wget overwriting even with -c (bug?)

2004-06-12 Thread Petr Kadlec
Hi folks!
Sometimes I experience very unpleasant behavior of wget (using some 
not-really-recent CVS version of wget 1.9, under W98SE). I have a partially 
downloaded file (usually a big one, there is not so big probability of 
interrupted download of a small file), so I want to finish the download 
(later, not within the same wget run) using something like
wget -c http://www.example.com/foo
After some time, I try to check the progress and, to my shock, I find that 
the file is being downloaded from start (so that the previously downloaded 
data is lost).
I was able to reproduce the problem (I hope it is the same problem, the 
original problem occurs very rarely): When wget -c is issued and the first 
connection attempt fails, in further attempts, wget does not even send 
Range header!
I have tried to track down the problem, and I have found one suspect 
(although I am not really familiar with the source, so this may not be 
correct):
In http_loop() (in http.c), there is code like the following:

/* Decide whether or not to restart.  */
hstat.restval = 0;
if (count > 1)
  hstat.restval = hstat.len; /* continue where we left off */
else if (opt.always_rest
&& stat (locf, &st) == 0
&& S_ISREG (st.st_mode))
  hstat.restval = st.st_size;
During the second attempt, hstat.len is still zero if the first connection 
attempt failed. But as count>1, we use that zero to initialize 
hstat.restval. (And, as a probably mistaken idea, in gethttp(), the file 
open mode is selected using hs->restval ? "ab" : "wb", so that even when 
opt.always_rest is set, this command is able to rewrite the file. It could 
be a good idea to at least assert that we are not overwriting the file if 
opt.always_rest is set.)

With regards,
Petr Kadlec
--
Don't whisper/Don't talk/Don't run if you can walk -- U2
--
Petr Kadlec <[EMAIL PROTECTED]>
ICQ #68196926, http://mormegil.wz.cz/


Re: [BUG] wget 1.9.1 and below can't download >=2G file on 32bits system

2004-05-27 Thread Hrvoje Niksic
Yup; 1.9.1 cannot download large files.  I hope to fix this by the
next release.



[BUG] wget 1.9.1 and below can't download >=2G file on 32bits system

2004-05-24 Thread Zhu, Yi

Hi,

I use wget on a i386 redhat 9 box to download 4G DVD from a ftp site.
The process stops at:

$ wget -c --proxy=off
ftp://redhat.com/pub/fedora/linux/core/2/i386/iso/FC2-i386-DVD.iso
--12:47:24--
ftp://redhat.com/pub/fedora/linux/core/2/i386/iso/FC2-i386-DVD.iso
   => `FC2-i386-DVD.iso'
Resolving redhat.com... done.
Connecting to redhat.com[134.14.16.42]:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.==> PWD ... done.
==> TYPE I ... done.  ==> CWD /pub/fedora/linux/core/2/i386/iso ...
done.
==> SIZE FC2-i386-DVD.iso ... done.
==> PORT ... done.==> REST 2147483647 ... done.
==> RETR FC2-i386-DVD.iso ... done.
Length: 75,673,600 [-2,071,810,047 to go] (unauthoritative)

100%[==>] 2,147,483,647   --.--K/s
ETA --:--File size limit exceeded

Note, the actual file size should be 4,370,640,896 but not 75,673,600 as
shown above.
I read ftp.c and found all the file size variables are defined as long.
I think this caused the
problem. I download the same image from a 64-bits machine, it works
fine.


Thanks,
-
Opinions expressed are those of the author and do not represent Intel
Corp.

- Zhu Yi (Chuyee)

GnuPG v1.0.6 (GNU/Linux)
http://cn.geocities.com/chewie_chuyee/gpg.txt or
$ gpg --keyserver wwwkeys.pgp.net --recv-keys 71C34820
1024D/71C34820 C939 2B0B FBCE 1D51 109A  55E5 8650 DB90 71C3 4820 


Re: Maybe a bug or something else for wget

2004-05-23 Thread Jens Rösner
Hi Ben!

Not at a bug as far as I can see.
Use -A to accept only certain files.
Furthermore, the pdf and ppt files are located across various servers, 
you need to allow wget to parse other servers than the original one by -H 
and then restrict it to only certain ones by -D.

wget -nc -x -r -l2 -p -erobots=off -t10 -w2 --random-wait --waitretry=7 -U
"Mozilla/4.03 [en] (X11; I; SunOS 5.5.1 sun4u)"
--referer="http://devresource.hp.com/drc/topics/utility_comp.jsp"; -k -v
-A*.ppt,*.pdf,utility_comp.jsp -H -Dwww.hpl.hp.com,www.nesc.ac.uk  
http://devresource.hp.com/drc/topics/utility_comp.jsp

works for me. It was generated using my gui front-end to wget, so it is not
streamlined ;)

Jens



> Hi,
> How can I download all pdf and ppt file by the following url with command
> line of:
> 
> wget -k -r -l 1 http://devresource.hp.com/drc/topics/utility_comp.jsp
> 
> I am on windows 2000 server sp4 with latest update.
> 
> E:\Release>wget -V
> GNU Wget 1.9.1
> 
> Copyright (C) 2003 Free Software Foundation, Inc.
> This program is distributed in the hope that it will be useful,
> but WITHOUT ANY WARRANTY; without even the implied warranty of
> MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> GNU General Public License for more details.
> 
> Originally written by Hrvoje Niksic <[EMAIL PROTECTED]>.
> 
> 
> Thank you for your nice work.
> 
> Ben
> 

-- 
NEU : GMX Internet.FreeDSL
Ab sofort DSL-Tarif ohne Grundgebühr: http://www.gmx.net/dsl



Maybe a bug or something else for wget

2004-05-23 Thread Gao, Ruidong








Hi,
How can I download all pdf and ppt
file by the following url
with command line of:

wget -k -r -l 1 http://devresource.hp.com/drc/topics/utility_comp.jsp

I am on windows 2000 server sp4 with latest update.

E:\Release>wget -V
GNU Wget 1.9.1

Copyright (C) 2003 Free Software Foundation, Inc.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied
warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. 
See the
GNU General Public License for more details.

Originally written by Hrvoje Niksic <[EMAIL PROTECTED]>.


Thank you for your nice work.

Ben








Bug report: two spaces between filesize and Month

2004-05-03 Thread Iztok Saje
Hello!
I just found a "feature" in embedded system (no source) with ftp server.
In listing, there are two spaces between fileize and month.
As a consequence, wget allways thinks size is 0.
In procedure ftp_parse_unix_ls  it just steps back one blank
before cur.size is calculated.
My quick hack is just to add one more pointer and atoi,
but maybe a nicer sollution can be done.
case from .listing:
-rw-rw-rw-   0 0  0  68065  Apr 16 08:00 A20040416.0745
-rw-rw-rw-   0 0  0781  Apr 20 07:45 A20040420.0730
-rw-rw-rw-   0 0  0  59606  Apr 16 08:15 A20040416.0800
-rw-rw-rw-   0 0  0781  Apr 23 12:15 A20040423.1200
-rw-rw-rw-   0 0  0   2130  Feb  3 12:00 A20040203.1145
-rw-rw-rw-   0 0  0  33440  Apr 14 12:15 A20040414.1200
BR
Iztok


May not be a Bug more a nice-2-have

2004-04-07 Thread Alexander Joerg Herrmann
Dear Reader,
some may not really consider it a Bug so it is maybe
more a nice-2-have
When I try to mirror the Internetpages I develop 
http://www.nachttraum.de
http://www.felixfrisch.de
wget complains that the linux complains that the file
name is to long. It is not exactly a Bug as I use cgi
with a very long argument List which cannot converted
1:1 to a valid filename. It maybe a nice to have to
have a wget switch/comandline option asking wget to
use a funktion like mktmpname to make up a filename of
it's own insteed of a uri->filename translation.
Regards,
Alexander Joerg Herrmann


=
___
http://www.newart4u.com   [EMAIL PROTECTED]
http://www.trangthailand.com


Yahoo! Messenger - Communicate instantly..."Ping" 
your friends today! Download Messenger Now 
http://uk.messenger.yahoo.com/download/index.html


wget bug: directory overwrite

2004-04-05 Thread Juhana Sadeharju
Hello.

Problem: When downloading all in
   http://udn.epicgames.com/Technical/MyFirstHUD
wget overwrites the downloaded MyFirstHUD file with
MyFirstHUD directory (which comes later).

GNU Wget 1.9.1
wget -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np -U Mozilla $@

Solution: Use of -E option.

Regards,
Juhana


[BUG?] --include option does not use an exact match for directories

2004-03-28 Thread William Bresler
Hello.

I am using wget 1.9.1.  According to the documentation on the --include option:

 `-I' option accepts a comma-separated list of directories included
 in the retrieval.  Any other directories will simply be ignored.
 The directories are absolute paths.
from(wget.info)Directory-Based Limits

Now, this is not a complete explanation of how the option works (for 
example, it does not state that wildcards are acceptable), but a reasonable 
expectation is that a path without wildcards will match only that exact 
path.  That's not what happens.

I am attempting to mirror a portion of an FTP site so I use a command 
similar to this:

wget --mirror -I /pub/mirrors/who/product/release 
ftp://my.domain.com/pub/mirrors/who/product/

There are several unwanted directories in the specified URL so I use the -I 
option to limit them.  What I discovered is that it is possible to get 
unwanted directories as well.

For example, say the directory I specified above has the following entries:

archive
mail
release
release-old
release-ancient
tmp
All that has to happen for the entry to pass the -I filter is that the 
filter match the beginning of the target.  So, all I want is the release 
directory, but what I get is release, release-old, and release-ancient.  In 
fact, I get the same result with a value of pub/mirrors/who/product/rel for 
the -I option.  In other words, it need not actually match *any* path 
completely in order for the directory to be recursively downloaded.

To see why this was happening, I stepped through the code in a debugger.

(ftp.c L1535)ftp_retrieve_dirs() is called since this is a recursive 
get.  For each directory in the list, it calls accdir() at line 1572 to see 
whether the current item is an ACCEPTED directory.  (utils.c L771) accdir() 
strips off any leading '/' since these are supposed to be all absolute 
paths.  It then calls proclist() passing opt.includes as the match list, 
and the candidate directory.  (utils.c L746)proclist() goes through each 
item in the list.  If there are wildcards, it uses fnmatch(), else it uses 
frontcmp() to determine whether the target passes the filter.  My entry 
does not have a wildcard, so it uses frontcmp().  (utils.c L737)frontcmp() 
scans the strings as long as it has not reached the end of either, and the 
current character in each is equal.  If it reaches the end of the first 
string (which is the entry from opt.includes) it returns 1, else 0.

Now, I can understand that the intent was to allow files in deeper 
subdirectories to match the Include filter without needing to isolate the 
path elements further.  For example:

   with -I /pub/mirrors/who/product/release  as before, all files in

/pub/mirrors/who/product/release/foo/
/pub/mirrors/who/product/release/bar/
/pub/mirrors/who/product/release/baz   etc.
will be accepted because they all begin with the given -I value.

But, I would suggest that at least for non-wildcard matches, the prefix 
should 'match' only if it is a path prefix which breaks at a path element 
separator (including the end-of-string signifying an exact match).  Better 
would be to include wildcard matches but that might be harder since it 
needs to have an implicit anchor of '/' or end-of-string which is not 
something the globbing RE engine can handle.  I had a look at the latest 
SIngle UNIX Specification to see if I could find any words of wisdom there 
about the fname() function's capabilities.  The specification is a bit 
vague in parts, but it does say that a slash must not be matched by either 
a '?' or '*' wildcard, or even in a character class, it must be explicitly 
included in the search pattern.

I tried putting a trailing '/' on the -I value, but the action function of 
cmd_directory_vector() invoked on the value trims any trailing '/'.  So, 
there does not seem to be a way to force the match of a path prefix which 
consists only of full path elements.  Use of frontcmp() makes non-wildcard 
-I values behave the same as if there were a trailing '*' and there is no 
way to retain a trailing '/' in the pattern.

The only possible way to (at least temporarily) achieve the effect I want 
is to enumerate all unwanted paths, where they have a prefix which matches 
any of the -I matches, as values to -X.  This works only as long as nobody 
puts a new entry on the remote site which matches the -I values, but is not 
in the -X values, something which I cannot control.  So, again, I say this 
is a bug.

I see that frontcmp() is also called by (recur.c)download_child_p which is 
an HTTP function, so any possible patch would probably need to just create 
a new function in utils.c solely for use in FTP directory matching.  It's 
only a two line function and it's only used once in utils.c so the impact 
will be smal

wget bug report

2004-03-26 Thread Corey Henderson
I sent this message to [EMAIL PROTECTED] as directed in the wget man page, but it 
bounced and said to try this email address.

This bug report is for GNU Wget 1.8.2 tested on both RedHat Linux 7.3 and 9

rpm -q wget
wget-1.8.2-9

When I use a wget with the -S to show the http headers, and I use the spider switch as 
well, it gives me a 501 error on some servers.

The main example I have found was doing it against a server running ntop.

http://www.ntop.org/

You can find an RPM for it at:

http://rpm.pbone.net/index.php3/stat/4/idpl/586625/com/ntop-2.2-0.dag.rh90.i386.rpm.html

You cean search with other parameters at rpm.pbone.net to get ntop for other version 
of linux

So here is the command and output:

wget -S --spider http://SERVER_WITH_NTOP:3000

HTTP request sent, awaiting response...
 1 HTTP/1.0 501 Not Implemented
 2 Date: Sat, 27 Mar 2004 07:08:24 GMT
 3 Cache-Control: no-cache
 4 Expires: 0
 5 Connection: close
 6 Server: ntop/2.2 (Dag Apt RPM Repository) (i686-pc-linux-gnu)
 7 Content-Type: text/html
21:11:56 ERROR 501: Not Implemented.

I get a 501 error. echoing the $? shows an exit status of 1

When I don't use the spider, I get the following:

wget -S http://SERVER_WITH_NTOP:3000

HTTP request sent, awaiting response...
 1 HTTP/1.0 200 OK
 2 Date: Sat, 27 Mar 2004 07:09:31 GMT
 3 Cache-Control: max-age=3600, must-revalidate, public
 4 Connection: close
 5 Server: ntop/2.2 (Dag Apt RPM Repository) (i686-pc-linux-gnu)
 6 Content-Type: text/html
 7 Last-Modified: Mon, 17 Mar 2003 20:27:49 GMT
 8 Accept-Ranges: bytes
 9 Content-Length: 1214

100%[==>]
 1,214  1.16M/sETA 00:00

21:13:04 (1.16 MB/s) - `index.html' saved [1214/1214]



The exit status was 0 and the index.html file was downloaded.

If this is a bug please fix it in your next release of wget. If it is not a bug, I 
would appriciate a brief explination as to why.

Thank You

Corey Henderson
Chief Programmer
GlobalHost.com

Re: Bug report

2004-03-24 Thread Hrvoje Niksic
Juhana Sadeharju <[EMAIL PROTECTED]> writes:

> Command: "wgetdir http://liarliar.sourceforge.net";.
> Problem: Files are named as
>   content.php?content.2
>   content.php?content.3
>   content.php?content.4
> which are interpreted, e.g., by Nautilus as manual pages and are
> displayed as plain texts. Could the files and the links to them
> renamed as the following?
>   content.php?content.2.html
>   content.php?content.3.html
>   content.php?content.4.html

Use the option `--html-extension' (-E).

> After all, are those pages still php files or generated html files?
> If they are html files produced by the php files, then it could be a
> good idea to add a new extension to the files.

They're the latter -- HTML files produced by the server-side PHP code.

> Command: "wgetdir 
> http://www.newtek.com/products/lightwave/developer/lscript2.6/index.html";
> Problem: Images are not downloaded. Perhaps because the image links
> are the following:
>   

I've never seen this tag, but it seems to be the same as IMG.  Mozilla
seems to grok it and its DOM inspector thinks it has seen IMG.  Is
this tag documented anywhere?  Does IE understand it too?



Bug report

2004-03-24 Thread Juhana Sadeharju
Hello. This is report on some wget bugs. My wgetdir command looks
the following (wget 1.9.1):
wget -k --proxy=off -e robots=off --passive-ftp -q -r -l 0 -np -U Mozilla $@

Bugs:

Command: "wgetdir http://www.directfb.org";.
Problem: In file "www.directfb.org/index.html" the hrefs of type
  "/screenshots/index.xml" was not converted to relative
  with "-k" option.

Command: "wgetdir http://threedom.sourceforge.net";.
Problem: In file "threedom.sourceforge.net/index.html" the
hrefs were not converted to relative with "-k" option.

Command: "wgetdir http://liarliar.sourceforge.net";.
Problem: Files are named as
  content.php?content.2
  content.php?content.3
  content.php?content.4
which are interpreted, e.g., by Nautilus as manual pages and are
displayed as plain texts. Could the files and the links to them
renamed as the following?
  content.php?content.2.html
  content.php?content.3.html
  content.php?content.4.html
After all, are those pages still php files or generated html files?
If they are html files produced by the php files, then it could
be a good idea to add a new extension to the files.

Command: "wgetdir 
http://www.newtek.com/products/lightwave/developer/lscript2.6/index.html";
Problem: Images are not downloaded. Perhaps because the image links
are the following:
  

Regards,
Juhana


wget bug in retrieving large files > 2 gig

2004-03-09 Thread Eduard Boer
Hi,

While downloading a file of about 3,234,550,172 bytes with "wget 
http://foo/foo.mpg"; I get an error:

HTTP request sent, awaiting response... 200 OK
Length: unspecified [video/mpeg]
   [  
<=>   
] -1,060,417,124   13.10M/s

wget: retr.c:292: calc_rate: Assertion `bytes >= 0' failed.
Aborted
The md5sum of downloaded and origanal file is de same! So there should 
not be an error.
The amound of 'bytes downloaded' during is not correct also: It become 
negative over 2 gig.

greetings from the Netherlands,
Eduard



Re: Bug in wget: cannot request urls with double-slash in the query string

2004-03-05 Thread Hrvoje Niksic
D Richard Felker III <[EMAIL PROTECTED]> writes:

>> The request log shows that the slashes are apparently respected.
>
> I retried a test case and found the same thing -- the slashes were
> respected.

OK.

> Then I remembered that I was using -i. Wget seems to work fine with
> the url on the command line; the bug only happens when the url is
> passed in with:
>
> cat < http://...
> EOF

But I cannot repeat that, either.  As long as the consecutive slashes
are in the query string, they're not stripped.

> Using this method is necessary since it is the ONLY secure way I
> know of to do a password-protected http request from a shell script.

Yes, that is the best way to do it.



Re: Bug in wget: cannot request urls with double-slash in the query string

2004-03-04 Thread D Richard Felker III
On Mon, Mar 01, 2004 at 07:25:52PM +0100, Hrvoje Niksic wrote:
> >> > Removing the offending code fixes the problem, but I'm not sure if
> >> > this is the correct solution. I expect it would be more correct to
> >> > remove multiple slashes only before the first occurrance of ?, but
> >> > not afterwards.
> >> 
> >> That's exactly what should happen.  Please give us more details, if
> >> possible accompanied by `-d' output.
> >
> > If you'd still like details now that you know the version I was
> > using, let me know and I'll be happy to do some tests.
> 
> Yes please.  For example, this is how it works for me:
> 
> $ /usr/bin/wget -d "http://www.xemacs.org/something?redirect=http://www.cnn.com";
> DEBUG output created by Wget 1.8.2 on linux-gnu.
> 
> --19:23:02--  http://www.xemacs.org/something?redirect=http://www.cnn.com
>=> `something?redirect=http:%2F%2Fwww.cnn.com'
> Resolving www.xemacs.org... done.
> Caching www.xemacs.org => 199.184.165.136
> Connecting to www.xemacs.org[199.184.165.136]:80... connected.
> Created socket 3.
> Releasing 0x8080b40 (new refcount 1).
> ---request begin---
> GET /something?redirect=http://www.cnn.com HTTP/1.0
> User-Agent: Wget/1.8.2
> Host: www.xemacs.org
> Accept: */*
> Connection: Keep-Alive
> 
> ---request end---
> HTTP request sent, awaiting response...
> ...
> 
> The request log shows that the slashes are apparently respected.

I retried a test case and found the same thing -- the slashes were
respected. Then I remembered that I was using -i. Wget seems to work
fine with the url on the command line; the bug only happens when the
url is passed in with:

cat <

Re: bug in use index.html

2004-03-04 Thread Dražen Kačar
Hrvoje Niksic wrote:
> The whole matter of conversion of "/" to "/index.html" on the file
> system is a hack.  But I really don't know how to better represent
> empty trailing file name on the file system.

Another, for now rather limited, hack: on file systems which support some
sort of file attributes you can mark index.html as an unwanted child of an
empty trailing file name. AFAIK, that should work at least on Solaris and
Linux. Others will join the club one day, I hope.

-- 
 .-.   .-.Yes, I am an agent of Satan, but my duties are largely
(_  \ /  _)   ceremonial.
 |
 |[EMAIL PROTECTED]


Re: bug in use index.html

2004-03-04 Thread Hrvoje Niksic
The whole matter of conversion of "/" to "/index.html" on the file
system is a hack.  But I really don't know how to better represent
empty trailing file name on the file system.



bug in use index.html

2004-03-04 Thread Василевский Сергей
Good day!
I use wget 1.9.1.

By default all link to root site "/" or "somedomain.com/" wget convert
to "/index.html" or "somedomain.com/index.html".
But some site don't use index.html as default page and if use timestamp
and continue download site in more than 1 session

1. wget first download index.html - all OK - he used for it data from
"/".
2. wget second time check need for download index.html, but site don't
have that file (it file may return data of hosting provider or ...)

I think wget must make reverse opetation for all "/index.html" -> "/"
befor try to download or check it.

_
Open Contact Ltd, Belarus
[EMAIL PROTECTED]
www.open.by 



Re: Bug in wget: cannot request urls with double-slash in the query string

2004-03-01 Thread Hrvoje Niksic
D Richard Felker III <[EMAIL PROTECTED]> writes:

>> > Think of something like http://foo/bar/redirect.cgi?http://...
>> > wget translates this into: [...]
>> 
>> Which version of Wget are you using?  I think even Wget 1.8.2 didn't
>> collapse multiple slashes in query strings, only in paths.
>
> I was using 1.8.2 and noticed the problem, so I upgraded to 1.9.1
> and it persisted.

OK.

>> > Removing the offending code fixes the problem, but I'm not sure if
>> > this is the correct solution. I expect it would be more correct to
>> > remove multiple slashes only before the first occurrance of ?, but
>> > not afterwards.
>> 
>> That's exactly what should happen.  Please give us more details, if
>> possible accompanied by `-d' output.
>
> If you'd still like details now that you know the version I was
> using, let me know and I'll be happy to do some tests.

Yes please.  For example, this is how it works for me:

$ /usr/bin/wget -d "http://www.xemacs.org/something?redirect=http://www.cnn.com";
DEBUG output created by Wget 1.8.2 on linux-gnu.

--19:23:02--  http://www.xemacs.org/something?redirect=http://www.cnn.com
   => `something?redirect=http:%2F%2Fwww.cnn.com'
Resolving www.xemacs.org... done.
Caching www.xemacs.org => 199.184.165.136
Connecting to www.xemacs.org[199.184.165.136]:80... connected.
Created socket 3.
Releasing 0x8080b40 (new refcount 1).
---request begin---
GET /something?redirect=http://www.cnn.com HTTP/1.0
User-Agent: Wget/1.8.2
Host: www.xemacs.org
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response...
...

The request log shows that the slashes are apparently respected.



Re: Bug in wget: cannot request urls with double-slash in the query string

2004-03-01 Thread D Richard Felker III
On Mon, Mar 01, 2004 at 03:36:55PM +0100, Hrvoje Niksic wrote:
> D Richard Felker III <[EMAIL PROTECTED]> writes:
> 
> > The following code in url.c makes it impossible to request urls that
> > contain multiple slashes in a row in their query string:
> [...]
> 
> That code is removed in CVS, so multiple slashes now work correctly.
> 
> > Think of something like http://foo/bar/redirect.cgi?http://...
> > wget translates this into: [...]
> 
> Which version of Wget are you using?  I think even Wget 1.8.2 didn't
> collapse multiple slashes in query strings, only in paths.

I was using 1.8.2 and noticed the problem, so I upgraded to 1.9.1 and
it persisted.

> > Removing the offending code fixes the problem, but I'm not sure if
> > this is the correct solution. I expect it would be more correct to
> > remove multiple slashes only before the first occurrance of ?, but
> > not afterwards.
> 
> That's exactly what should happen.  Please give us more details, if
> possible accompanied by `-d' output.

If you'd still like details now that you know the version I was using,
let me know and I'll be happy to do some tests.

Rich



Re: Bug in wget: cannot request urls with double-slash in the query string

2004-03-01 Thread Hrvoje Niksic
D Richard Felker III <[EMAIL PROTECTED]> writes:

> The following code in url.c makes it impossible to request urls that
> contain multiple slashes in a row in their query string:
[...]

That code is removed in CVS, so multiple slashes now work correctly.

> Think of something like http://foo/bar/redirect.cgi?http://...
> wget translates this into: [...]

Which version of Wget are you using?  I think even Wget 1.8.2 didn't
collapse multiple slashes in query strings, only in paths.

> Removing the offending code fixes the problem, but I'm not sure if
> this is the correct solution. I expect it would be more correct to
> remove multiple slashes only before the first occurrance of ?, but
> not afterwards.

That's exactly what should happen.  Please give us more details, if
possible accompanied by `-d' output.



Bug in wget: cannot request urls with double-slash in the query string

2004-02-29 Thread D Richard Felker III
The following code in url.c makes it impossible to request urls that
contain multiple slashes in a row in their query string:

else if (*h == '/')
{
  /* Ignore empty path elements.  Supporting them well is hard
 (where do you save "http://x.com///y.html";?), and they
 don't bring any practical gain.  Plus, they break our
 filesystem-influenced assumptions: allowing them would
 make "x/y//../z" simplify to "x/y/z", whereas most people
 would expect "x/z".  */
  ++h;
}

Think of something like http://foo/bar/redirect.cgi?http://...
wget translates this into:

http://foo/bar/redirect.cgi?http:/...

and then the web server of course gives an error. Note that the
problem occurs even if the slashes were url escaped, since wget
unescapes them.

Removing the offending code fixes the problem, but I'm not sure if
this is the correct solution. I expect it would be more correct to
remove multiple slashes only before the first occurrance of ?, but not
afterwards.

Rich



Re: bug in connect.c

2004-02-06 Thread Hrvoje Niksic
Manfred Schwarb <[EMAIL PROTECTED]> writes:

>> Interesting.  Is it really necessary to zero out sockaddr/sockaddr_in
>> before using it?  I see that some sources do it, and some don't.  I
>> was always under the impression that, as long as you fill the relevant
>> members (sin_family, sin_addr, sin_port), other initialization is not
>> necessary.  Was I mistaken, or is this something specific to FreeBSD?
>>
>> Do others have experience with this?
>
> e.g. look at http://cvs.tartarus.org/putty/unix/uxnet.c
>
> putty encountered the very same problem ...

Amazing.  This obviously doesn't show up when binding to remote
addresses, or it would have been noticed ages ago.

Thanks for the pointer.  This patch should fix the problem in the CVS
version:

2004-02-06  Hrvoje Niksic  <[EMAIL PROTECTED]>

* connect.c (sockaddr_set_data): Zero out
sockaddr_in/sockaddr_in6.  Apparently BSD-derived stacks need this
when binding a socket to local address.

Index: src/connect.c
===
RCS file: /pack/anoncvs/wget/src/connect.c,v
retrieving revision 1.62
diff -u -r1.62 connect.c
--- src/connect.c   2003/12/12 14:14:53 1.62
+++ src/connect.c   2004/02/06 16:59:01
@@ -87,6 +87,7 @@
 case IPV4_ADDRESS:
   {
struct sockaddr_in *sin = (struct sockaddr_in *)sa;
+   xzero (*sin);
sin->sin_family = AF_INET;
sin->sin_port = htons (port);
sin->sin_addr = ADDRESS_IPV4_IN_ADDR (ip);
@@ -96,6 +97,7 @@
 case IPV6_ADDRESS:
   {
struct sockaddr_in6 *sin6 = (struct sockaddr_in6 *)sa;
+   xzero (*sin6);
sin6->sin6_family = AF_INET6;
sin6->sin6_port = htons (port);
sin6->sin6_addr = ADDRESS_IPV6_IN6_ADDR (ip);


Re: bug in connect.c

2004-02-06 Thread Manfred Schwarb
Interesting.  Is it really necessary to zero out sockaddr/sockaddr_in
before using it?  I see that some sources do it, and some don't.  I
was always under the impression that, as long as you fill the relevant
members (sin_family, sin_addr, sin_port), other initialization is not
necessary.  Was I mistaken, or is this something specific to FreeBSD?
Do others have experience with this?


e.g. look at http://cvs.tartarus.org/putty/unix/uxnet.c

putty encountered the very same problem ...

regards
manfred


Re: bug in connect.c

2004-02-04 Thread Hrvoje Niksic
"francois eric" <[EMAIL PROTECTED]> writes:

> after some test:
> bug is when: ftp, with username and password, with bind address specifyed
> bug is not when: http, ftp without username and password
> looks like memory leaks. so i made some modification before bind:
> src/connect.c:
> --
> ...
>   /* Bind the client side to the requested address. */
>   wget_sockaddr bsa;
> //!
>   memset (&bsa,0,sizeof(bsa));
> /!!
>   wget_sockaddr_set_address (&bsa, ip_default_family, 0, &bind_address);
>   if (bind (sock, &bsa.sa, sockaddr_len ()))
> ..
> --
> after it all downloads become sucesfull.
> i think better do memset in wget_sockaddr_set_address, but it is for your
> choose.

Interesting.  Is it really necessary to zero out sockaddr/sockaddr_in
before using it?  I see that some sources do it, and some don't.  I
was always under the impression that, as long as you fill the relevant
members (sin_family, sin_addr, sin_port), other initialization is not
necessary.  Was I mistaken, or is this something specific to FreeBSD?

Do others have experience with this?



bug in connect.c

2004-02-03 Thread francois eric
problem: wget can't download by command
wget --bind-address=Your.External.Ip.Address -d -c -v -b -a "logo.txt"
ftp://anonymous:[EMAIL PROTECTED]/incoming/Xenos/knigi/Programming/LinuxUni
x/SHELL/PICTURES/LOGO.GIF
logo.txt contains:
--
DEBUG output created by Wget 1.9.1 on freebsd4.5.
--11:36:30--
ftp://anonymous:[EMAIL PROTECTED]/incoming/Xenos/knigi/Programming/Li
nuxUnix/SHELL/PICTURES/LOGO.GIF
  => `LOGO.GIF'
Connecting to 193.233.88.66:21... Releasing 0x807a0d0 (new refcount 0).
Deleting unused 0x807a0d0.
Closing fd 4
failed: Can't assign requested address.
Releasing 0x807a0b0 (new refcount 0).
Deleting unused 0x807a0b0.
Retrying.
.
--
so failure is in bind command.  i tested the same command, but
without --bind-address (logo.gif appeared on my hdd):
--
DEBUG output created by Wget 1.9.1 on freebsd4.5.
--11:39:22--
ftp://anonymous:[EMAIL PROTECTED]/incoming/Xenos/knigi/Programming/Li
nuxUnix/SHELL/PICTURES/LOGO.GIF
  => `LOGO.GIF'
Connecting to 193.233.88.66:21... connected.
Created socket 4.
Releasing 0x807a0a0 (new refcount 0).
Deleting unused 0x807a0a0.
Logging in as anonymous ... 220 diamond.stup.ac.ru FTP server (Version
wu-2.6.2-8) ready.
...
--
after some test:
bug is when: ftp, with username and password, with bind address specifyed
bug is not when: http, ftp without username and password
looks like memory leaks. so i made some modification before bind:
src/connect.c:
--
...
 /* Bind the client side to the requested address. */
 wget_sockaddr bsa;
//!
 memset (&bsa,0,sizeof(bsa));
/!!
 wget_sockaddr_set_address (&bsa, ip_default_family, 0, &bind_address);
 if (bind (sock, &bsa.sa, sockaddr_len ()))
..
--
after it all downloads become sucesfull.
i think better do memset in wget_sockaddr_set_address, but it is for your
choose.
best regards
p.s. sorry for my english 8(
_
The new MSN 8: smart spam protection and 2 months FREE*  
http://join.msn.com/?page=features/junkmail



Re: bug report

2004-01-28 Thread Hrvoje Niksic
You are right, it's a bug.  -O is implemented in a weird way, which
makes it work strangely with features such as timestamping and link
conversion.  I plan to fix it when I get around to revamping the file
name generation support for grokking the Content-Disposition header.


BUG : problem of date with wget

2004-01-27 Thread Olivier RAMIARAMANANA (Ste Thales IS)
** High Priority **

Hi

On My server AIX I use "wget" with this command
/usr/local/bin/wget http://www.???.?? -O /exploit/log/test.log

but when I read my file "test.log" its date it's January 30 2003 ???
that's incredible

What's the problem please

Regards

olivier


Re: wget bug with ftp/passive

2004-01-22 Thread Hrvoje Niksic
don <[EMAIL PROTECTED]> writes:

> I did not specify the "passive" option, yet it appears to have been used
> anyway Here's a short transcript:
>
> [EMAIL PROTECTED] sim390]$ wget ftp://musicm.mcgill.ca/sim390/sim390dm.zip
> --21:05:21--  ftp://musicm.mcgill.ca/sim390/sim390dm.zip
>=> `sim390dm.zip'
> Resolving musicm.mcgill.ca... done.
> Connecting to musicm.mcgill.ca[132.206.120.4]:21... connected.
> Logging in as anonymous ... Logged in!
> ==> SYST ... done.==> PWD ... done.
> ==> TYPE I ... done.  ==> CWD /sim390 ... done.
> ==> PASV ...
> Cannot initiate PASV transfer.

Are you sure that something else hasn't done it for you?  For example,
a system-wide initialization file `/usr/local/etc/wgetrc' or
`/etc/wgetrc'.


wget bug with ftp/passive

2004-01-21 Thread don
Hello,
I think I've come across a little bug in wget when using it to get a file
via ftp.

I did not specify the "passive" option, yet it appears to have been used
anyway Here's a short transcript:

[EMAIL PROTECTED] sim390]$ wget ftp://musicm.mcgill.ca/sim390/sim390dm.zip
--21:05:21--  ftp://musicm.mcgill.ca/sim390/sim390dm.zip
   => `sim390dm.zip'
Resolving musicm.mcgill.ca... done.
Connecting to musicm.mcgill.ca[132.206.120.4]:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.==> PWD ... done.
==> TYPE I ... done.  ==> CWD /sim390 ... done.
==> PASV ...
Cannot initiate PASV transfer.
==> PORT ... done.==> RETR sim390dm.zip ... done.

[EMAIL PROTECTED] sim390]$ man wget


As you can see, PASV was attempted (and failed)

I was looking for an option to prevent passive mode.

[EMAIL PROTECTED] sim390]$ wget --version
GNU Wget 1.8.2

Running on Fedora Core 1.

Regards,
Don Russell


Re: wget bug

2004-01-12 Thread Hrvoje Niksic
Kairos <[EMAIL PROTECTED]> writes:

> $ cat wget.exe.stackdump
[...]

What were you doing with Wget when it crashed?  Which version of Wget
are you running?  Was it compiled for Cygwin or natively for Windows?


wget bug

2004-01-06 Thread Kairos
$ cat wget.exe.stackdump
Exception: STATUS_ACCESS_VIOLATION at eip=77F51BAA
eax= ebx= ecx=0700 edx=610CFE18 esi=610CFE08 edi=
ebp=0022F7C0 esp=0022F74C program=C:\nonspc\cygwin\bin\wget.exe
cs=001B ds=0023 es=0023 fs=0038 gs= ss=0023
Stack trace:
Frame Function  Args
0022F7C0  77F51BAA  (000CFE08, 6107C8F1, 610CFE08, )
0022FBA8  77F7561D  (1004D9C0, , 0022FC18, 00423EF8)
0022FBB8  00424ED9  (1004D9C0, 0022FBF0, 0001, 0022FBF0)
0022FC18  00423EF8  (1004A340, 002A, 7865646E, 6D74682E)
0022FD38  0041583B  (1004A340, 0022FD7C, 0022FD80, 100662C8)
0022FD98  00420D93  (10066318, 0022FDEC, 0022FDF0, 100662C8)
0022FE18  0041EB7D  (10021A80, 0041E460, 610CFE40, 0041C2F4)
0022FEF0  0041C47B  (0004, 61600B64, 10020330, 0022FF24)
0022FF40  61005018  (610CFEE0, FFFE, 07E4, 610CFE04)
0022FF90  610052ED  (, , 0001, )
0022FFB0  00426D41  (0041B7D0, 037F0009, 0022FFF0, 77E814C7)
0022FFC0  0040103C  (0001, 001D, 7FFDF000, F6213CF0)
0022FFF0  77E814C7  (00401000, , 78746341, 0020)
End of stack trace


bug report

2003-12-30 Thread Vlada Macek

Hi again,

I found something what can be called a bug.

The command line and the output (shortened):

$ wget -k www.seznam.cz
--14:14:28--  http://www.seznam.cz/
   => `index.html'
Resolving www.seznam.cz... done.
Connecting to www.seznam.cz[212.80.76.18]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[ <=> ] 19,975 3.17M/s

14:14:28 (3.17 MB/s) - `index.html' saved [19975]

Converting index.html... 5-123
Converted 1 files in 0.01 seconds.

---
That is, newly created file is really link-converted.

Now I run:

$ wget -k -O myfile www.seznam.cz
--14:16:07--  http://www.seznam.cz/
   => `myfile'
Resolving www.seznam.cz... done.
Connecting to www.seznam.cz[212.80.76.3]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]

[ <=> ] 19,980 3.18M/s

14:16:07 (3.18 MB/s) - `myfile' saved [19980]

index.html.1: No such file or directory
Converting index.html.1... nothing to do.
Converted 1 files in 0.00 seconds.

---
Now myfile is created and then wget tries to convert index.html.1, i.e. the 
file it normally *would* create if there was no -O option... 

When I wish the content to be sent to stdout (-O -), this postponed 
converting function is run again on index.html.1. Which is totally wrong, all 
content has been sent out to stdout already.

Not only my content is not link-converted. Is not here a possibility, that
wget can inadvertently garble files on disk it has nothing to do with?

Vlada Macek




Re: Maybe a bug?

2003-12-28 Thread Jens Rösner
Hi!

Well, the message you got really tells you to have a 
look at the user agreement.
So I did.
http://www.quickmba.com/site/agreement/
clearly explains why your download failed under the point "Acceptable Use"

As long as you have wget identifying itself as wget, 
you probably will not get any files.

CU
Jens

PS: http://www.quickmba.com/robots.txt does not even list wget (yet?)



> I'm playing around with the wget tool and I ran into this website that I
> don't believe the "-e robots=off" works.  http://www.quickmba.com/ any
> idea
> why?
> 
>  
> 
> I've tried a few combinations and I keep on getting this message in the
> response.
> 
>  
> 
> We're sorry, but the way that you have attempted to access this site is
> not
> permitted by the QuickMBA User Agreement. Please contact us via the site
> contact form if you have any questions or believe that you have received
> this message in error.
> 
>  
> 
> Any ideas why?
> 
>  
> 
> 

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++
Neu: Preissenkung für MMS und FreeMMS! http://www.gmx.net




Maybe a bug?

2003-12-28 Thread James Li-Chung Chen
I'm playing around with the wget tool and I ran into this website that I
don't believe the "-e robots=off" works.  http://www.quickmba.com/ any idea
why?

 

I've tried a few combinations and I keep on getting this message in the
response.

 

We're sorry, but the way that you have attempted to access this site is not
permitted by the QuickMBA User Agreement. Please contact us via the site
contact form if you have any questions or believe that you have received
this message in error.

 

Any ideas why?

 



isn't it a little bug?

2003-12-23 Thread piotrek
Hi, I've just noticed a weird behavior of wget 1.8.2 while downloading a 
partial file with command:
wget http://ardownload.adobe.com/pub/adobe/acrobatreader/unix/5.x/
linux-508.tar.gz -c
The connection was very unstable, so it had to reconnect many times. What i 
noticed is not a big thing, just incorrect information, because (as You can 
see above) the file wasn't really fully retrieved.

Piotr Maj


--20:13:50--  http://ardownload.adobe.com/pub/adobe/acrobatreader/unix/5.x/
linux-508.tar.gz
  (try: 6) => `linux-508.tar.gz'
Connecting to ardownload.adobe.com[192.150.18.28]:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 9,285,902 (1,970,591 to go) [application/x-gzip]

79% [>] 7,375,135  9.53K/sETA 
03:15

20:13:57 (9.53 KB/s) - Connection closed at byte 7375135. Retrying.

--20:14:03--  http://ardownload.adobe.com/pub/adobe/acrobatreader/unix/5.x/
linux-508.tar.gz
  (try: 7) => `linux-508.tar.gz'
Connecting to ardownload.adobe.com[192.150.18.28]:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 9,285,902 (1,910,767 to go) [application/x-gzip]

79% [>] 7,380,618  4.40K/sETA 
07:02

20:20:46 (4.40 KB/s) - Connection closed at byte 7380618. Retrying.

--20:20:53--  http://ardownload.adobe.com/pub/adobe/acrobatreader/unix/5.x/
linux-508.tar.gz
  (try: 8) => `linux-508.tar.gz'
Connecting to ardownload.adobe.com[192.150.18.28]:80... connected.
HTTP request sent, awaiting response... 503 Service Unavailable

The file is already fully retrieved; nothing to do.



bug? different behavior of "wget" and "lwp-request (GET)"

2003-12-17 Thread Diego Puppin
Dear all,

I found that some pages give a different reply, when queried with
different tools. For instance:

wget http://groups.yahoo.com/group/sammydavisjr/message/56

retrieves a re-direction header (HTTP 302). Then, wget follows the
redirection. On the other side

GET http://groups.yahoo.com/group/sammydavisjr/message/56

retrieves a standard page (HTTP 200).
Is this a bug (of GET, wget?) or a "feature"?

I realized this problem when testing two different Java program to
download pages from a URL. One uses a Java socket, the other uses Java
URLConnection. Well, **even if the request and the headers sent are the
same - byte-by-byte **, the two program gets different results. In
particular, the raw socket communication behaves like "wget", the other
using Java URLConnection, behaves like "GET".

What do you think?
Thank you, guys
Diego




Bug in 1.9.1? ftp not following symlinks

2003-12-09 Thread Manfred Schwarb
hi
i tried to download the following:
wget 
ftp://ftp.suse.com/pub/suse/i386/7.3/full-names/src/traceroute-nanog_6.1.1-94.src.rpm

this is a symbolic link.
downloading just this single file, wget should follow the link, but it
creates only a symbolic link.
excerpt from "man wget", section --retr-symlinks:
   Note that when retrieving a file (not a directory)
   because it was specified on the command-line, rather
   than because it was recursed to, this option has no
   effect.  Symbolic links are always traversed in this
   case.
but:
wget --retr-symlinks 
ftp://ftp.suse.com/pub/suse/i386/7.3/full-names/src/traceroute-nanog_6.1.1-94.src.rpm
does get the file correctly.

this occurs only if using "timestamping = on" in wgetrc, without it 
everything
works ok.

regards
manfred


Re: non-subscribers have to confirm each message to bug-wget

2003-11-18 Thread Hrvoje Niksic
Dan Jacobson <[EMAIL PROTECTED]> writes:

>>> And stop making me have to confirm each and every mail to this list.
>
> Hrvoje> Currently the only way to avoid confirmations is to
> Hrvoje> subscribe to the list.  I'll try to contact the list owners
> Hrvoje> to see if the mechanism can be improved.
>
> subscribe me with the "nomail" option, if it can't be fixed.

I can't (easily) subscribe other people to the list.  The best I can
do is ask the list owners to come up with a whitelisting policy.

> often I come back from a long vacation, only to find my last reply
> is waiting for confirmation, that probably expired.

Sorry about that.


Re: non-subscribers have to confirm each message to bug-wget

2003-11-17 Thread Dan Jacobson
>> And stop making me have to confirm each and every mail to this list.

Hrvoje> Currently the only way to avoid confirmations is to subscribe to the
Hrvoje> list.  I'll try to contact the list owners to see if the mechanism can
Hrvoje> be improved.

subscribe me with the "nomail" option, if it can't be fixed.

often I come back from a long vacation, only to find my last reply is
waiting for confirmation, that probably expired.


Re: Wget Bug

2003-11-10 Thread Hrvoje Niksic
"Kempston" <[EMAIL PROTECTED]> writes:

> Yeah, i understabd that, but lftp hadles it fine even without
> specifying any additional option ;)

But then lftp is hammering servers when real unauthorized entry
occurs, no?

> I`m sure you can work something out

Well, I'm satisfied with what Wget does now.  :-)


Re: Wget Bug

2003-11-10 Thread Hrvoje Niksic
The problem is that the server replies with "login incorrect", which
normally means that authorization has failed and that further retries
would be pointless.  Other than having a natural language parser
built-in, Wget cannot know that the authorization is in fact correct,
but that the server happens to be busy.

Maybe Wget should have an option to retry even in the case of (what
looks like) a login incorrect FTP response.


Wget Bug

2003-11-10 Thread Kempston
Here is debug output

:/FTPD# wget ftp://ftp.dcn-asu.ru/pub/windows/update/winxp/xpsp2-1224.exe -d
DEBUG output created by Wget 1.8.1 on linux-gnu.

--13:25:55--  ftp://ftp.dcn-asu.ru/pub/windows/update/winxp/xpsp2-1224.exe
   => `xpsp2-1224.exe'
Resolving ftp.dcn-asu.ru... done.
Caching ftp.dcn-asu.ru => 212.192.20.40
Connecting to ftp.dcn-asu.ru[212.192.20.40]:21... connected.
Created socket 3.
Releasing 0x8073398 (new refcount 1).
Logging in as anonymous ... 220 news FTP server ready.

--> USER anonymous
331 Guest login ok, send your complete e-mail address as password.
--> PASS -wget@
530 Login incorrect.

Login incorrect.
Closing fd 3

Server reply is 

<--- 530-
<--- 530-Sorry! Too many users are logged in.
<--- 530-Try letter, please.
<--- 530-
<--- 530 Login incorrect.
 Server reply matched ftp:retry-530, retrying

But wget won`t even try to retry :(
Can you fix that ?

old patch to ".netrc quote parsing bug"

2003-10-25 Thread Noèl Köthe
Hello,

upgrading to 1.9 I found an old unapplied patch to fix a parsing problem
with .netrc.

"The cause of this behavior is wget's .netrc parser failing to reset
it's
quote flag after seeing a quote at the end of a token.  That caused problems
for lines with unquoted tokens following quoted ones."

This is the patch:

--- wget-1.8.1/src/netrc.c  Fri Nov 30 03:33:22 2001
+++ wget-fixed/src/netrc.c  Sun Feb 17 17:12:20 2002
@@ -313,9 +313,12 @@
p ++;
  }
 
- /* if field was quoted, squash the trailing quotation mark */
- if (quote)
+ /* if field was quoted, squash the trailing quotation mark
+and reset quote flag */
+ if (quote) {
shift_left(p);
+   quote = 0;
+ }
 
  /* Null-terminate the token, if it isn't already.  */
  if (*p)

this is the bugreport:

--8<--
I have a ~/.netrc with the following content:

machine www.somewhere.com login "foo" password bar

running wget always spams the following to my console:

wget: /home/bilbo/.netrc:1: unknown token "password bar
"
--8<--


-- 
Noèl Köthe 
Debian GNU/Linux, www.debian.org


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: Bug: Support of charcters like '\', '?', '*', ':' in URLs

2003-10-21 Thread Hrvoje Niksic
"Frank Klemm" <[EMAIL PROTECTED]> writes:

> Wget don't work properly when the URL contains characters which are
> not allowed in file names on the file system which is currently
> used. These are often '\', '?', '*' and ':'.
>
> Affected are at least:
> - Windows and related OS
> - Linux when using FAT or Samba as file system
[...]

Thanks for the report.  This has been fixed in Wget 1.9-beta.  It
doesn't use characters that FAT can't handle by default, and if you
use a mounted FAT filesystem, you can tell Wget to assume behavior as
if it were under Windows.



Bug: Support of charcters like '\', '?', '*', ':' in URLs

2003-10-21 Thread Frank Klemm
Wget don't work properly when the URL contains characters which are not
allowed in file names
on the file system which is currently used. These are often '\', '?', '*'
and ':'.

Affected are at least:
- Windows and related OS
- Linux when using FAT or Samba as file system

Possibilty to solve:

On startup try to built files

mkdir "wget.tmpfile"
for ( i = 0x20; i <= 0x7E; i++ ) {
sprintf ( name, "wget.tmpfile/\%c", i );
test to open for read as file =>  okay, then this char is avaialble for
file names
test to create=>  okay, then this char is avaialble for
file names
remove file
}
remove dir "wget.tmpfile"

chars which are not available are escaped as @3F or #3F or stuff like that.


= Example 
--14:51:11--  http://fan.theonering.net/rolozo/?view=collection
   => `fan.theonering.net/rolozo/index.html?view=collection'
Connecting to kdejenspi01.zeiss.de[10.2.39.56]:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
fan.theonering.net/rolozo/index.html?view=collection: Invalid argument

Cannot write to `fan.theonering.net/rolozo/index.html?view=collection'
(Invalid argument).
Removing fan.theonering.net/rolozo/index.html?view=collection since it
should be rejected.
unlink: Invalid argument
--14:51:12--  http://fan.theonering.net/rolozo/?view=quotes
   => `fan.theonering.net/rolozo/index.html?view=quotes'
Connecting to kdejenspi01.zeiss.de[10.2.39.56]:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
fan.theonering.net/rolozo/index.html?view=quotes: Invalid argument

Cannot write to `fan.theonering.net/rolozo/index.html?view=quotes' (Invalid
argument).
Removing fan.theonering.net/rolozo/index.html?view=quotes since it should
be rejected.
unlink: Invalid argument
--14:51:13--  http://fan.theonering.net/rolozo/?view=about
   => `fan.theonering.net/rolozo/index.html?view=about'
Connecting to kdejenspi01.zeiss.de[10.2.39.56]:8080... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/html]
fan.theonering.net/rolozo/index.html?view=about: Invalid argument


--
Frank Klemm
AIM Advanced Imaging Microscopy
phone: +49 (36 41) 64-27 21
fax: +49 (36 41) 64-31 44
mailto:[EMAIL PROTECTED]





RE: Wget 1.8.2 bug

2003-10-20 Thread Sergey Vasilevsky
Thanks for explain this reasons.

And I have anoter problem:
in .wgetrc I use
reject =
*.[zZ][iI][pP]*,*.[rR][aA][rR]*,*.[gG][iI][fF]*,*.[jJ][pP][gG]*,*.[Ee][xX][E
e]*,*[=]http*
accept =
*.yp*,*.pl*,*.dll*,*.nsf*,*.[hH][tT][mM]*,*.[pPsSjJ][hH][tT][mM]*,*.[pP][hH]
[pP]*,*.[jJ][sS][pP]*,*.[tT][xX][tT],*.[cC][gG][iI]*,*.[cC][sS][pP]*,*.[aA][
sS][pP]*,*[?]*

In command line add some more rules '-R xxx' - I think it joined with
previos rules.
And use recursive download.

In result I found *.zip and *.exe ...  files!
What I do wrong?

> -Original Message-
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
> Sent: Friday, October 17, 2003 7:18 PM
> To: Tony Lewis
> Cc: Wget List
> Subject: Re: Wget 1.8.2 bug
>
>
> "Tony Lewis" <[EMAIL PROTECTED]> writes:
>
> > Hrvoje Niksic wrote:
> >
> >> Incidentally, Wget is not the only browser that has a problem with
> >> that.  For me, Mozilla is simply showing the source of
> >> <http://www.minskshop.by/cgi-bin/shop.cgi?id=1&cookie=set>, because
> >> the returned content-type is text/plain.
> >
> > On the other hand, Internet Explorer will treat lots of content
> > types as HTML if the content starts with "".
>
> I know.  But so far noone has asked for this in Wget.
>
> > Perhaps we can add an option to wget so that it will look for an
> >  tag in plain text files?
>
> If more people clamor for the option, I suppose we could overload
> `--force-html' to perform such detection.
>



Re: Wget 1.8.2 bug

2003-10-17 Thread Hrvoje Niksic
"Tony Lewis" <[EMAIL PROTECTED]> writes:

> Hrvoje Niksic wrote:
>
>> Incidentally, Wget is not the only browser that has a problem with
>> that.  For me, Mozilla is simply showing the source of
>> , because
>> the returned content-type is text/plain.
>
> On the other hand, Internet Explorer will treat lots of content
> types as HTML if the content starts with "".

I know.  But so far noone has asked for this in Wget.

> Perhaps we can add an option to wget so that it will look for an
>  tag in plain text files?

If more people clamor for the option, I suppose we could overload
`--force-html' to perform such detection.


Re: Wget 1.8.2 bug

2003-10-17 Thread Tony Lewis
Hrvoje Niksic wrote:

> Incidentally, Wget is not the only browser that has a problem with
> that.  For me, Mozilla is simply showing the source of
> , because
> the returned content-type is text/plain.

On the other hand, Internet Explorer will treat lots of content types as
HTML if the content starts with "".

To see for yourself, try these links:
http://www.exelana.com/test.cgi
http://www.exelana.com/test.cgi?text/plain
http://www.exelana.com/test.cgi?image/jpeg

Perhaps we can add an option to wget so that it will look for an  tag
in plain text files?

Tony



Re: Wget 1.8.2 bug

2003-10-17 Thread Hrvoje Niksic
"??? ??" <[EMAIL PROTECTED]> writes:

>> I've seen pages that do that kind of redirections, but Wget seems
>> to follow them, for me.  Do you have an example I could try?
>>
> [EMAIL PROTECTED]:~/> /usr/local/bin/wget -U
> "All.by"  -np -r -N -nH --header="Accept-Charset: cp1251, windows-1251, win,
> x-cp1251, cp-1251" --referer=http://minskshop.by  -P /tmp/minskshop.by -D
> minskshop.by http://minskshop.by http://www.minskshop.by
[...]

The problem with these pages lies not in redirection, but in the fact
that the server returns them with the `text/plain' content-type
instead of `text/html', which Wget requires in order to treat a page
as HTML.

Observe:

> --13:05:47--  http://minskshop.by/cgi-bin/shop.cgi?id=1&cookie=set
> Length: ignored [text/plain]
> --13:05:53--  http://minskshop.by/cgi-bin/shop.cgi?id=1&cookie=set
> Length: ignored [text/plain]
> --13:05:59--  http://www.minskshop.by/cgi-bin/shop.cgi?id=1&cookie=set
> Length: ignored [text/plain]
> --13:06:00--  http://www.minskshop.by/cgi-bin/shop.cgi?id=1&cookie=set
> Length: ignored [text/plain]

Incidentally, Wget is not the only browser that has a problem with
that.  For me, Mozilla is simply showing the source of
, because
the returned content-type is text/plain.


Re: bug in 1.8.2 with

2003-10-14 Thread Hrvoje Niksic
You're right -- that code was broken.  Thanks for the patch; I've now
applied it to CVS with the following ChangeLog entry:

2003-10-15  Philip Stadermann  <[EMAIL PROTECTED]>

* ftp.c (ftp_retrieve_glob): Correctly loop through the list whose
elements might have been deleted.




bug in 1.8.2 with

2003-10-14 Thread Noèl Köthe
Hello,

which this download you will get a segfault.

wget --passive-ftp --limit-rate 32k -r -nc -l 50 \
-X */binary-alpha,*/binary-powerpc,*/source,*/incoming \
-R alpha.deb,powerpc.deb,diff.gz,.dsc,.orig.tar.gz \
ftp://ftp.gwdg.de/pub/x11/kde/stable/3.1.4/Debian

Philip Stadermann <[EMAIL PROTECTED]> discover this problem
and submitted the attached patch.
Its a problem with the linked list.

-- 
Noèl Köthe 
Debian GNU/Linux, www.debian.org
--- ftp.c.orig  2003-10-14 15:37:15.0 +0200
+++ ftp.c   2003-10-14 15:39:28.0 +0200
@@ -1670,22 +1670,21 @@
 static uerr_t
 ftp_retrieve_glob (struct url *u, ccon *con, int action)
 {
-  struct fileinfo *orig, *start;
+  struct fileinfo *start;
   uerr_t res;
   struct fileinfo *f;
  
 
   con->cmd |= LEAVE_PENDING;
 
-  res = ftp_get_listing (u, con, &orig);
+  res = ftp_get_listing (u, con, &start);
   if (res != RETROK)
 return res;
-  start = orig;
   /* First: weed out that do not conform the global rules given in
  opt.accepts and opt.rejects.  */
   if (opt.accepts || opt.rejects)
 {
-   f = orig;
+   f = start;
   while (f)
{
  if (f->type != FT_DIRECTORY && !acceptable (f->name))
@@ -1698,7 +1697,7 @@
}
 }
   /* Remove all files with possible harmful names */
-  f = orig;
+  f = start;
   while (f)
   {
  if (has_invalid_name(f->name))


signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil


Re: Wget 1.8.2 bug

2003-10-14 Thread Hrvoje Niksic
"Sergey Vasilevsky" <[EMAIL PROTECTED]> writes:

> I use wget 1.8.2.  When I try recursive download site site.com where
> site.com/ first page redirect to site.com/xxx.html that have first
> link in the page to site.com/ then Wget download only xxx.html and
> stop.  Other links from xxx.html not followed!

I've seen pages that do that kind of redirections, but Wget seems to
follow them, for me.  Do you have an example I could try?


Wget 1.8.2 bug

2003-10-14 Thread Sergey Vasilevsky
I use wget 1.8.2.
When I try recursive download site site.com where
site.com/ first page redirect to site.com/xxx.html that have first link in
the page to site.com/
then Wget download only xxx.html and stop.
Other links from xxx.html not followed!



Re: subtle bug? or opportunity of avoiding multiple nested directories

2003-10-10 Thread Hrvoje Niksic
Stephen Hewitt <[EMAIL PROTECTED]> writes:

> Attempting to mirror a particular web site, with wget 1.8.1, I got
> many nested directories like .../images/images/images/images etc For
> example the log file ended like this:
[...]

Thanks for the detailed report and for taking the time to find the
problem.  I've seen similar problems, but have never had the
inclination to find the cause; `--mirror' is especially susceptible to
this because it implies `-l0'.

> The fundamental problem in this case is that if you ask for a page
> that does not exist from www.can-online.org.uk, the server does not
> respond correctly.  Instead of presenting the 404 not found, it
> serves up some default web page.

Right.  And if that same web page contains the link to a non-existent
image, an infloop insues.

> Now at this point is the opportunity for wget to be more robust.
[...]
> So my suggestion is this.  If wget is following an 
> address from a page, and instead of the expected [image/gif] (or
> jpeg or whatever) file type the server gives a [text/html], then
> wget should not follow any links that the text/html file contains.

I agree.  I've attached a patch against the current CVS code that
should fix this problem.  Could you please try it out and let me know
if it works for you?

> Perhaps you could even argue that it should report an error and not
> even save the html file,

That would be going too far.  If the  returns an HTML
page, so be it.  Let the browsers cope with it as best they can; Wget
will simply not harvest the links from such a page.

2003-10-10  Hrvoje Niksic  <[EMAIL PROTECTED]>

* recur.c (retrieve_tree): Don't descend into documents that are
not expected to contain HTML, regardless of their content-type.

* html-url.c (tag_url_attributes): Record which attributes are
supposed to yield HTML links that can be followed.
(tag_find_urls): Propagate that information to the caller through
struct urlpos.

Index: src/convert.h
===
RCS file: /pack/anoncvs/wget/src/convert.h,v
retrieving revision 1.1
diff -u -r1.1 convert.h
--- src/convert.h   2003/09/21 22:47:13 1.1
+++ src/convert.h   2003/10/10 14:07:41
@@ -56,11 +56,11 @@
 
   /* Information about the original link: */
 
-  unsigned int link_relative_p :1; /* was the link relative? */
-  unsigned int link_complete_p :1; /* was the link complete (with the
-  host name, etc.) */
-  unsigned int link_base_p :1; /* was the link  */
-  unsigned int link_inline_p   :1; /* needed to render the page. */
+  unsigned int link_relative_p :1; /* the link was relative */
+  unsigned int link_complete_p :1; /* the link was complete (had host name) */
+  unsigned int link_base_p :1; /* the url came from  */
+  unsigned int link_inline_p   :1; /* needed to render the page */
+  unsigned int link_expect_html:1; /* expected to contain HTML */
 
   unsigned int link_refresh_p  :1; /* link was received from
*/
Index: src/html-url.c
===
RCS file: /pack/anoncvs/wget/src/html-url.c,v
retrieving revision 1.33
diff -u -r1.33 html-url.c
--- src/html-url.c  2003/10/10 02:46:09 1.33
+++ src/html-url.c  2003/10/10 14:07:42
@@ -121,11 +121,19 @@
 /* tag_url_attributes documents which attributes of which tags contain
URLs to harvest.  It is used by tag_find_urls.  */
 
-/* Defines for the FLAGS field; currently only one flag is defined. */
+/* Defines for the FLAGS. */
 
-/* This tag points to an external document not necessary for rendering this 
-   document (i.e. it's not an inlined image, stylesheet, etc.). */
-#define TUA_EXTERNAL 1
+/* The link is "inline", i.e. needs to be retrieved for this document
+   to be correctly rendered.  Inline links include inlined images,
+   stylesheets, children frames, etc.  */
+#define ATTR_INLINE1
+
+/* The link is expected to yield HTML contents.  It's important not to
+   try to follow HTML obtained by following e.g. 
+   regardless of content-type.  Doing this causes infinite loops for
+   "images" that return non-404 error pages with links to the same
+   image.  */
+#define ATTR_HTML  2
 
 /* For tags handled by tag_find_urls: attributes that contain URLs to
download. */
@@ -134,26 +142,26 @@
   const char *attr_name;
   int flags;
 } tag_url_attributes[] = {
-  { TAG_A, "href", TUA_EXTERNAL },
-  { TAG_APPLET,"code", 0 },
-  { TAG_AREA,  "href", TUA_EXTERNAL },
-  { TAG_BGSOUND,   "src",  0 },
-  { TAG_BODY,  "background",   0 },
-  { TAG_EMBED, "href", TUA_EXTERNAL },
-  { TAG_EMBED, "src",  0 },
-  { TAG_FIG,   "src",  0 },
-  { TAG_FRAME, "src",  0 },
-  { TAG_IFRAME,"src", 

subtle bug? or opportunity of avoiding multiple nested directories

2003-10-09 Thread Stephen Hewitt
Attempting to mirror a particular web site, with wget 1.8.1, I got many
nested directories like .../images/images/images/images etc For example
the log file ended like this:

--08:16:37--
http://www.can-online.org.uk/SE/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/spacer.gif
   =>
`www.can-online.org.uk/SE/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/images/spacer.gif'
Reusing connection to www.can-online.org.uk:80.
HTTP request sent, awaiting response... 414 URL Too Long
08:16:37 ERROR 414: URL Too Long.


FINISHED --08:16:37--
Downloaded: 23,505,670 bytes in 906 files

Similar problems have happened with enough websites that I have tried to
mirror, to be worth investigating, so this time I have spent a
little time doing this. I'm posting this in case people are interested in
the reasons in this particular case.  I dont think the exact reasons
why these multiple nested directories occur is going to be identical on
every website though.

Firstly I should say that it is basically caused by the website rather
than the excellent wget, but maybe there are ways that wget could be made
even better.  Also its quite clear that although I think the proposed fix
would work in this case, it won't solve all nested directory problems

The fundamental problem in this case is that if you ask for a page that
does not exist from www.can-online.org.uk, the server does not respond
correctly.  Instead of presenting the 404 not found, it serves up some
default web page.

This explains why when wget asks for ridiculous things like
http://www.can-online.org.uk/SE/images/images/images/images/images/spacer.gif
it receives what appears to be a valid response.

However it does not explain why wget asks for things like this in the
first place.

The reason for that turns out to be that there is a mistake in some of the
html pages on the webserver.For example in
http://www.can-online.org.uk/contact/ one of the included images has been
written as
 instead of


As a result, wget tried to GET the (non-existent) image
http://www.can-online.org.uk/contact/images/spacer.gif

Now, because of the fundamental problem mentioned above, instead of
returning 404 not found, this web server instead serves up its default
html page, instead of an image.

Now at this point is the opportunity for wget to be more robust.

Because at present, it seems as though when wget sees the html page, it
forgets that it was expecting an image and it parses the html and tries to
follow all the links in the html.

(Note that what is expected must be an image because it is in an  construct.)

Now, in the case of www.can-online.org.uk the unfortunate fact is that the
default html page also contains the same mistake in some of the 
constructs, so the process repeats ad infinitum with wget trying to get
deeper and deeper levels of .../index/index/index/index

So my suggestion is this.  If wget is following an  address
from a page, and instead of the expected [image/gif] (or jpeg or whatever)
file type the server gives a [text/html], then wget should not follow any
links that the text/html file contains.  Perhaps you could even argue that
it should report an error and not even save the html file, because as far
as I can see, it doesn't make any kind of sense to include a [text/html]
file where an image should be.

Of course this will not necessarily solve the nested directory problem
with other websites - because if the mistake in the html was in a link to
an html page rather than an image, for example, this fix wouldn't help -
but at least it would work on this one.

If you are interested in this, I have set up a very simple test case
(using 4 files, of a few lines each) on a webserver to demonstrate what
wget does, will be happy to post details.

Stephen Hewitt




RE: Bug in Windows binary?

2003-10-06 Thread Herold Heiko
> From: Gisle Vanem [mailto:[EMAIL PROTECTED]

> "Jens Rösner" <[EMAIL PROTECTED]> said:
> 
...
 
> I assume Heiko didn't notice it because he doesn't have that function
> in his kernel32.dll. Heiko and Hrvoje, will you correct this ASAP?
> 
> --gv

Probably.
Currently I'm compiling and testing on NT 4.0 only.
Beside that I'm VERY tight on time in this moment so testing usually means
"does it run ? Does it download one sample http and one https site ? Yes ?
Put it up for testing!".

Heiko

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED]
-- +39-041-5907073 ph
-- +39-041-5907472 fax


Re: Bug in Windows binary?

2003-10-05 Thread Hrvoje Niksic
"Gisle Vanem" <[EMAIL PROTECTED]> writes:

> --- mswindows.c.org Mon Sep 29 11:46:06 2003
> +++ mswindows.c Sun Oct 05 17:34:48 2003
> @@ -306,7 +306,7 @@
>  DWORD set_sleep_mode (DWORD mode)
>  {
>HMODULE mod = LoadLibrary ("kernel32.dll");
> -  DWORD (*_SetThreadExecutionState) (DWORD) = NULL;
> +  DWORD (WINAPI *_SetThreadExecutionState) (DWORD) = NULL;
>DWORD rc = (DWORD)-1;
>
> I assume Heiko didn't notice it because he doesn't have that
> function in his kernel32.dll. Heiko and Hrvoje, will you correct
> this ASAP?

I've now applied the patch, thanks.  I use the following ChangeLog
entry:

2003-10-05  Gisle Vanem  <[EMAIL PROTECTED]>

* mswindows.c (set_sleep_mode): Fix type of
_SetThreadExecutionState.



Re: Bug in Windows binary?

2003-10-05 Thread Gisle Vanem
"Jens Rösner" <[EMAIL PROTECTED]> said:

> I downloaded
> wget 1.9 beta 2003/09/29 from Heiko
> http://xoomer.virgilio.it/hherold/
...
> wget -d http://www.google.com
> DEBUG output created by Wget 1.9-beta on Windows.
>
> set_sleep_mode(): mode 0x8001, rc 0x8000
>
> I disabled my wgetrc as well and the output was exactly the same.
>
> I then tested
> wget 1.9 beta 2003/09/18 (earlier build!)
> from the same place and it works smoothly.
>
> Can anyone reproduce this bug?

Yes, but the MSVC version crashed on my machine.  But I've found
the cause caused by my recent change :(

A "simple" case of wrong calling-convention:

--- mswindows.c.org Mon Sep 29 11:46:06 2003
+++ mswindows.c Sun Oct 05 17:34:48 2003
@@ -306,7 +306,7 @@
 DWORD set_sleep_mode (DWORD mode)
 {
   HMODULE mod = LoadLibrary ("kernel32.dll");
-  DWORD (*_SetThreadExecutionState) (DWORD) = NULL;
+  DWORD (WINAPI *_SetThreadExecutionState) (DWORD) = NULL;
   DWORD rc = (DWORD)-1;

I assume Heiko didn't notice it because he doesn't have that function
in his kernel32.dll. Heiko and Hrvoje, will you correct this ASAP?

--gv




Bug in Windows binary?

2003-10-05 Thread Jens Rösner
Hi!

I downloaded 
wget 1.9 beta 2003/09/29 from Heiko
http://xoomer.virgilio.it/hherold/
along with the SSL binaries.
wget --help 
and 
wget --version 
will work, but 
any downloading like 
wget http://www.google.com
will immediately fail.
The debug output is very brief as well:

wget -d http://www.google.com
DEBUG output created by Wget 1.9-beta on Windows.

set_sleep_mode(): mode 0x8001, rc 0x8000

I disabled my wgetrc as well and the output was exactly the same.

I then tested 
wget 1.9 beta 2003/09/18 (earlier build!)
from the same place and it works smoothly.

Can anyone reproduce this bug?
System is Win2000, latest Service Pack installed.

Thanks for your assistance and sorry if I missed an 
earlier report of this bug, I know a lot has been done over the last weeks 
and I may have missed something.
Jens



-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: BUG in --timeout (exit status)

2003-10-02 Thread Manfred Schwarb
OK, I see.
But I do not agree.
And I don't think it is a good idea to treat the first download special.

In my opinion, exit status 0 means "everything during the whole 
retrieval went OK".
My prefered solution would be to set the final exit status to the highest
exit status of all individual downloads. Of course, retries which are 
triggered by "--tries" should erase the exit status of the previous attempt.
A non-zero exit status does not mean "nothing went OK" but "some individual
downloads failed somehow".
And setting a non-zero exit status does not mean wget has to stop
retrieval immediately, it is OK to continue.

Again, wget's behaviour is not what the user expects.

And the user has always the possibility to make combinations of
--accept, --reject, --domains, etc. so in normal cases all 
individual downloads succeed, if he needs a exit status 0.
If he does not care about exit status, there is no problem at all,
of course...


regards
Manfred


Zitat von Hrvoje Niksic <[EMAIL PROTECTED]>:

> This problem is not specific to timeouts, but to recursive download (-r).
> 
> When downloading recursively, Wget expects some of the specified
> downloads to fail and does not propagate that failure to the code that
> sets the exit status.  This unfortunately includes the first download,
> which should probably be an exception.
> 




This message was sent using IMP, the Internet Messaging Program.


Re: BUG in --timeout (exit status)

2003-10-02 Thread Hrvoje Niksic
This problem is not specific to timeouts, but to recursive download (-r).

When downloading recursively, Wget expects some of the specified
downloads to fail and does not propagate that failure to the code that
sets the exit status.  This unfortunately includes the first download,
which should probably be an exception.


BUG in --timeout (exit status)

2003-10-02 Thread Manfred Schwarb
Hi,

doing the following:
# /tmp/wget-1.9-beta3/src/wget -r --timeout=5 --tries=1
http://weather.cod.edu/digatmos/syn/
--11:33:16--  http://weather.cod.edu/digatmos/syn/
   => `weather.cod.edu/digatmos/syn/index.html'
Resolving weather.cod.edu... 192.203.136.228
Connecting to weather.cod.edu[192.203.136.228]:80... failed: Connection timed
out.
Giving up.


FINISHED --11:33:21--
Downloaded: 0 bytes in 0 files

# echo $?
0

If wget aborts because of an timeout (all --*-timeout options), it sets 
an exit status of 0, which is not what users are expecting,
and which makes it very difficult to catch such aborts.

Using "--non-verbose" in this example, I get no response at all that
something might have failed. An abort has nothing to do with being
verbose or not, it should always be notified in some way, IMHO. 

Further more, wget man and info pages should document the exit status,
I could not find any documentation about wget's exit status.


In contrast, curl does the right thing (non-zero exit status):
# curl -r --connect-timeout 5 http://weather.cod.edu/digatmos/syn/
curl: (7)
# echo $?
7


regards
Manfred



This message was sent using IMP, the Internet Messaging Program.


Re: dificulty with Debian wget bug 137989 patch

2003-09-30 Thread Hrvoje Niksic
"jayme" <[EMAIL PROTECTED]> writes:
[...]

Before anything else, note that the patch originally written for 1.8.2
will need change for 1.9.  The change is not hard to make, but it's
still needed.

The patch didn't make it to canonical sources because it assumes `long
long', which is not available on many platforms that Wget supports.
The issue will likely be addressed in 1.10.

Having said that:

> I tried the patch Debian bug report 137989 and didnt work. Can
> anybody explain:
> 1 - why I have to make to directories for patch work: one
> wget-1.8.2.orig and one wget-1.8.2 ?

You don't.  Just enter Wget's source and type `patch -p1  2 - why after compilation the wget still cant download the file >
> 2GB ?

I suspect you've tried to apply the patch to Wget 1.9-beta, which
doesn't work, as explained above.



dificulty with Debian wget bug 137989 patch

2003-09-29 Thread jayme
I tried the patch Debian bug report 137989 and didnt work. Can anybody explain:
1 - why I have to make to directories for patch work: one wget-1.8.2.orig and one 
wget-1.8.2 ?
2 - why after compilation the wget still cant download the file > 2GB ?
note : I cut the patch for debian use ( the first diff ) 
Thank you
Jayme 
[EMAIL PROTECTED]


Re: wget bug

2003-09-26 Thread Hrvoje Niksic
Jack Pavlovsky <[EMAIL PROTECTED]> writes:

> It's probably a bug: bug: when downloading wget -mirror
> ftp://somehost.org/somepath/3acv14~anivcd.mpg, wget saves it as-is,
> but when downloading wget ftp://somehost.org/somepath/3*, wget saves
> the files as 3acv14%7Eanivcd.mpg

Thanks for the report.  The problem here is that Wget tries to be
"helpful" by encoding unsafe characters in file names to %XX, as is
done in URLs.  Your first example works because of an oversight (!) 
that actually made Wget behave as you expected.

The good news is that the "helpfulness" has been rethought for the
next release and is no longer there, at least not for ordinary
characters like "~" and " ".  Try getting the latest CVS sources, they
should work better in this regard.  (http://wget.sunsite.dk/ explains
how to download the source from CVS.)


Re: wget bug

2003-09-26 Thread DervishD
Hi Jack :)

 * Jack Pavlovsky <[EMAIL PROTECTED]> dixit:
> It's probably a bug:
> bug: when downloading 
> wget -mirror ftp://somehost.org/somepath/3acv14~anivcd.mpg, 
>  wget saves it as-is, but when downloading
> wget ftp://somehost.org/somepath/3*, wget saves the files as 
> 3acv14%7Eanivcd.mpg

Yes, it *was* a bug. The lastest prerelease has it fixed. Don't
know if the tarball has the latest patches, ask Hvroje. But if you
are not in a hurry, just wait for 1.9 to be released.

> The human knowledge belongs to the world

True ;))

Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.pleyades.net & http://raul.pleyades.net/


wget bug

2003-09-26 Thread Jack Pavlovsky
It's probably a bug:
bug: when downloading 
wget -mirror ftp://somehost.org/somepath/3acv14~anivcd.mpg, 
 wget saves it as-is, but when downloading
wget ftp://somehost.org/somepath/3*, wget saves the files as 3acv14%7Eanivcd.mpg

--
The human knowledge belongs to the world


RE: bug maybe?

2003-09-23 Thread Matt Pease
how do I get off this list?   I tried a few times before & 
got no response from the server.

thank you-
Matt

> -Original Message-
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, September 23, 2003 8:53 PM
> To: Randy Paries
> Cc: [EMAIL PROTECTED]
> Subject: Re: bug maybe?
> 
> 
> "Randy Paries" <[EMAIL PROTECTED]> writes:
> 
> > Not sure if this is a bug or not.
> 
> I guess it could be called a bug, although it's no simple oversight.
> Wget currently doesn't support large files.
> 


Re: bug maybe?

2003-09-23 Thread Hrvoje Niksic
"Randy Paries" <[EMAIL PROTECTED]> writes:

> Not sure if this is a bug or not.

I guess it could be called a bug, although it's no simple oversight.
Wget currently doesn't support large files.



bug maybe?

2003-09-23 Thread Randy Paries
Not sure if this is a bug or not.
 
i can not get a file over 2GB (i get a MAX file Exceeded error message)
 
this is on a redhat 9 box. GNU Wget 1.8.2,
 
Thanks
Randy


Re: bug in wget 1.8.1/1.8.2

2003-09-16 Thread Hrvoje Niksic
Dieter Drossmann <[EMAIL PROTECTED]> writes:

> I use a extra file with a long list of http entries. I included this
> file with the -i option.  After 154 downloads I got an error
> message: Segmentation fault.
>
> With wget 1.7.1 everything works well.
>
> Is there a new limit of lines?

No, there's no built-in line limit, what you're seeing is a bug.

I cannot see anything wrong inspecting the code, so you'll have to
help by providing a gdb backtrace.  You can get it by doing this:

* Compile Wget with `-g' by running `make CFLAGS=-g' in its source
  directory (after configure, of course.)

* Go to the src/ directory and run that version of Wget the same way
  you normally run it, e.g. ./wget -i FILE.

* When Wget crashes, run `gdb wget core', type `bt' and mail us the
  resulting stack trace.

Thanks for the report.



bug in wget 1.8.1/1.8.2

2003-09-16 Thread Dieter Drossmann
Hello,

I use a extra file with a long list of http entries. I included this 
file with the  -i option.
After 154 downloads I got an error message: Segmentation fault.

With wget 1.7.1 everything works well.

Is there a new limit of lines?

Regards,
Dieter Drossmann




Re: possible bug in exit status codes

2003-09-15 Thread Aaron S. Hawley
I can verify this in the cvs version.
it appears to be isolated to the recursive behavior.

/a

On Mon, 15 Sep 2003, Dawid Michalczyk wrote:

> Hello,
>
> I'm having problems getting the exit status code to work correctly in
> the following scenario. The exit code should be 1 yet it is 0


possible bug in exit status codes

2003-09-14 Thread Dawid Michalczyk
Hello,

I'm having problems getting the exit status code to work correctly in the
following scenario. The exit code should be 1 yet it is 0


[EMAIL PROTECTED]:~$ wget -d -t2 -r -l1 -T120 -nd -nH -R 
gif,zip,txt,exe,wmv,htmll,*[1-99]  www.cnn.com/foo.html
DEBUG output created by Wget 1.8.2 on linux-gnu.

Enqueuing http://www.cnn.com/foo.html at depth 0
Queue count 1, maxcount 1.
Dequeuing http://www.cnn.com/foo.html at depth 0
Queue count 0, maxcount 1.
--01:00:11--  http://www.cnn.com/foo.html
   => `foo.html'
Resolving www.cnn.com... done.
Caching www.cnn.com => 64.236.16.52 64.236.16.84 64.236.16.116 64.236.24.4 
64.236.24.12 64.236.24.20 64.236.24.28 64.236.16.20
Connecting to www.cnn.com[64.236.16.52]:80... connected.
Created socket 3.
Releasing 0x80809f8 (new refcount 1).
---request begin---
GET /foo.html HTTP/1.0
User-Agent: Wget/1.8.2
Host: www.cnn.com
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 404 Not found
Server: Netscape-Enterprise/6.1 AOL
Date: Mon, 15 Sep 2003 04:59:58 GMT
Content-type: text/html
Connection: close


Closing fd 3
01:00:11 ERROR 404: Not found.


FINISHED --01:00:11--
Downloaded: 0 bytes in 0 files
[EMAIL PROTECTED]:~$ echo $?
0
[EMAIL PROTECTED]:~$

Dawid Michalczyk


Re: bug in wget - wget break on time msec=0

2003-09-13 Thread Hrvoje Niksic
"Boehn, Gunnar von" <[EMAIL PROTECTED]> writes:

> I think I found a bug in wget.

You did.  But I believe your subject line is slightly incorrect.  Wget
handles 0 length time intervals (see the assert message), but what it
doesn't handle are negative amounts.  And indeed:

> gettimeofday({1063461157, 858103}, NULL) = 0
> gettimeofday({1063461157, 858783}, NULL) = 0
> gettimeofday({1063461157, 880833}, NULL) = 0
> gettimeofday({1063461157, 874729}, NULL) = 0

As you can see, the last gettimeofday returned time *preceding* the
one before it.  Your ntp daemon must have chosen that precise moment
to set back the system clock by ~6 milliseconds, to which Wget reacted
badly.

Even so, Wget shouldn't crash.  The correct fix is to disallow the
timer code from ever returning decreasing or negative time intervals.
Please let me know if this patch fixes the problem:


2003-09-14  Hrvoje Niksic  <[EMAIL PROTECTED]>

* utils.c (wtimer_sys_set): Extracted the code that sets the
current time here.
(wtimer_reset): Call it.
(wtimer_sys_diff): Extracted the code that calculates the
difference between two system times here.
(wtimer_elapsed): Call it.
(wtimer_elapsed): Don't return a value smaller than the previous
one, which could previously happen when system time is set back.
Instead, reset start time to current time and note the elapsed
offset for future calculations.  The returned times are now
guaranteed to be monotonically nondecreasing.

Index: src/utils.c
===
RCS file: /pack/anoncvs/wget/src/utils.c,v
retrieving revision 1.51
diff -u -r1.51 utils.c
--- src/utils.c 2002/05/18 02:16:25 1.51
+++ src/utils.c 2003/09/13 23:09:13
@@ -1532,19 +1532,30 @@
 # endif
 #endif /* not WINDOWS */
 
-struct wget_timer {
 #ifdef TIMER_GETTIMEOFDAY
-  long secs;
-  long usecs;
+typedef struct timeval wget_sys_time;
 #endif
 
 #ifdef TIMER_TIME
-  time_t secs;
+typedef time_t wget_sys_time;
 #endif
 
 #ifdef TIMER_WINDOWS
-  ULARGE_INTEGER wintime;
+typedef ULARGE_INTEGER wget_sys_time;
 #endif
+
+struct wget_timer {
+  /* The starting point in time which, subtracted from the current
+ time, yields elapsed time. */
+  wget_sys_time start;
+
+  /* The most recent elapsed time, calculated by wtimer_elapsed().
+ Measured in milliseconds.  */
+  long elapsed_last;
+
+  /* Approximately, the time elapsed between the true start of the
+ measurement and the time represented by START.  */
+  long elapsed_pre_start;
 };
 
 /* Allocate a timer.  It is not legal to do anything with a freshly
@@ -1577,22 +1588,17 @@
   xfree (wt);
 }
 
-/* Reset timer WT.  This establishes the starting point from which
-   wtimer_elapsed() will return the number of elapsed
-   milliseconds.  It is allowed to reset a previously used timer.  */
+/* Store system time to WST.  */
 
-void
-wtimer_reset (struct wget_timer *wt)
+static void
+wtimer_sys_set (wget_sys_time *wst)
 {
 #ifdef TIMER_GETTIMEOFDAY
-  struct timeval t;
-  gettimeofday (&t, NULL);
-  wt->secs  = t.tv_sec;
-  wt->usecs = t.tv_usec;
+  gettimeofday (wst, NULL);
 #endif
 
 #ifdef TIMER_TIME
-  wt->secs = time (NULL);
+  time (wst);
 #endif
 
 #ifdef TIMER_WINDOWS
@@ -1600,39 +1606,76 @@
   SYSTEMTIME st;
   GetSystemTime (&st);
   SystemTimeToFileTime (&st, &ft);
-  wt->wintime.HighPart = ft.dwHighDateTime;
-  wt->wintime.LowPart  = ft.dwLowDateTime;
+  wst->HighPart = ft.dwHighDateTime;
+  wst->LowPart  = ft.dwLowDateTime;
 #endif
 }
 
-/* Return the number of milliseconds elapsed since the timer was last
-   reset.  It is allowed to call this function more than once to get
-   increasingly higher elapsed values.  */
+/* Reset timer WT.  This establishes the starting point from which
+   wtimer_elapsed() will return the number of elapsed
+   milliseconds.  It is allowed to reset a previously used timer.  */
 
-long
-wtimer_elapsed (struct wget_timer *wt)
+void
+wtimer_reset (struct wget_timer *wt)
 {
+  /* Set the start time to the current time. */
+  wtimer_sys_set (&wt->start);
+  wt->elapsed_last = 0;
+  wt->elapsed_pre_start = 0;
+}
+
+static long
+wtimer_sys_diff (wget_sys_time *wst1, wget_sys_time *wst2)
+{
 #ifdef TIMER_GETTIMEOFDAY
-  struct timeval t;
-  gettimeofday (&t, NULL);
-  return (t.tv_sec - wt->secs) * 1000 + (t.tv_usec - wt->usecs) / 1000;
+  return ((wst1->tv_sec - wst2->tv_sec) * 1000
+ + (wst1->tv_usec - wst2->tv_usec) / 1000);
 #endif
 
 #ifdef TIMER_TIME
-  time_t now = time (NULL);
-  return 1000 * (now - wt->secs);
+  return 1000 * (*wst1 - *wst2);
 #endif
 
 #ifdef WINDOWS
-  FILETIME ft;
-  SYSTEMTIME st;
-  ULARGE_INTEGER uli;
-  GetSystemTime (&st);
-  SystemTimeToFileTime (&st, &ft);
-  uli.HighPart = ft.dwHighDateTime;
-  uli.LowPart = ft.dwLowDa

bug in wget - wget break on time msec=0

2003-09-13 Thread Boehn, Gunnar von
Hello,


I think I found a bug in wget.

My GNU wget version is 1.82
My system GNU/Debian unstable


I use wget to replay our apache logfiles to a 
test webserver to try different tuning parameters.


Wget fails to run through the logfile
and give out the error message that "msec >=0 failed".

This is the command I run
#time wget -q -i replaylog -O /dev/null


Here is the output of strace
#time strace wget -q -i replaylog -O /dev/null

read(4, "HTTP/1.1 200 OK\r\nDate: Sat, 13 S"..., 4096) = 4096
write(3, "\377\330\377\340\0\20JFIF\0\1\1\1\0H\0H\0\0\377\354\0\21"...,
3792) = 3792
gettimeofday({1063461157, 858103}, NULL) = 0
select(5, [4], NULL, [4], {900, 0}) = 1 (in [4], left {900, 0})
read(4, "\377\0\344=\217\355V\\\232\363\16\221\255\336h\227\361"..., 1435) =
1435
write(3, "\377\0\344=\217\355V\\\232\363\16\221\255\336h\227\361"..., 1435)
= 1435
gettimeofday({1063461157, 858783}, NULL) = 0
time(NULL)  = 1063461157
access("390564.jpg?time=1060510404", F_OK) = -1 ENOENT (No such file or
directory)
time(NULL)  = 1063461157
select(5, [4], NULL, NULL, {0, 1})  = 0 (Timeout)
time(NULL)  = 1063461157
select(5, NULL, [4], [4], {900, 0}) = 1 (out [4], left {900, 0})
write(4, "GET /fotos/4/390564.jpg?time=106"..., 244) = 244
select(5, [4], NULL, [4], {900, 0}) = 1 (in [4], left {900, 0})
read(4, "HTTP/1.1 200 OK\r\nDate: Sat, 13 S"..., 4096) = 4096
write(3, "\377\330\377\340\0\20JFIF\0\1\1\1\0H\0H\0\0\377\333\0C"..., 3792)
= 3792
gettimeofday({1063461157, 880833}, NULL) = 0
select(5, [4], NULL, [4], {900, 0}) = 1 (in [4], left {900, 0})
read(4, "\343P\223\36T\4\203Rc\317\257J\4x\2165\303;o\211\256+\222"..., 817)
= 817
write(3, "\343P\223\36T\4\203Rc\317\257J\4x\2165\303;o\211\256+\222"...,
817) = 817
gettimeofday({1063461157, 874729}, NULL) = 0
time(NULL)  = 1063461157
write(2, "wget: retr.c:262: calc_rate: Ass"..., 60wget: retr.c:262:
calc_rate: Assertion `msecs >=
 0' failed.
) = 60
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
getpid()= 7106
kill(7106, SIGABRT) = 0
--- SIGABRT (Aborted) @ 0 (0) ---
+++ killed by SIGABRT +++


I hope that help.
Keep up the good work

Kind regards

Gunnar


Maybe a bug in wget?

2003-09-09 Thread n_fujikawa
Dear Sir;

 We are using wget-1.8.2 and it's very convinient for our routine
program.  By the way, now we have a trouble with the return code
from wget in case of trying to use it with -r option,  When wget with
-r option fails in a ftp connection, wget returns a code 0.  If no -r
option,  it returns a code 1.  We look over source programs, and find
a suspicious line in ftp.c.



 +1699if ((opt.ftp_glob && wild) || opt.recursive ||
opt.timestamping)
 +1700  {
 +1701/* ftp_retrieve_glob is a catch-all function that gets
called
 +1702   if we need globbing, time-stamping or recursion.  Its
 +1703   third argument is just what we really need.  */
 +1704ftp_retrieve_glob (u, &con,
 +1705   (opt.ftp_glob && wild) ? GLOBALL :
GETONE);
 +1706  }
 +1707else
 +1708  res = ftp_loop_internal (u, NULL, &con);

We guess the line 1704 should be a following line in order to return the
error code back to the main function.

 +1704res = ftp_retrieve_glob (u, &con,
 +1705   (opt.ftp_glob && wild) ? GLOBALL :
GETONE);

Is this right?  If we change ftp.c in this way, does any other problems not
occured?

Best Regards,
   Norihisa Fujikawa,
   Programming Section in Numerical Prediction
Division,
   Japan Meteorological Agency



Re: *** Workaround found ! *** (was: Hostname bug in wget ...)

2003-09-05 Thread Hrvoje Niksic
[EMAIL PROTECTED] writes:

> I found a workaround for the problem described below.
>
> Using option -nh does the job for me.
>
> As the subdomains mentioned below are on the same IP
> as the "main" domain wget seems not to compare their
> names but the IP only.

I believe newer versions of Wget don't do that anymore.  At the time
Wget was originally written, DNS-based virtual hosting was still in
its infancy.  Nowadays almost everyone does it, so what used to be
`-nh' became the default.

Either way, thanks for the report.


*** Workaround found ! *** (was: Hostname bug in wget ...)

2003-09-05 Thread webmaster
Hi,

I found a workaround for the problem described below.

Using option -nh does the job for me.

As the subdomains mentioned below are on the same IP
as the "main" domain wget seems not to compare their
names but the IP only.

If you need more info please let me know.
Have a nice weekend !

Regards
Klaus
--- Forwarded message follows ---
From:   [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Date sent:  Thu, 4 Sep 2003 12:53:39 +0200
Subject:    Hostname bug in wget ...
Priority:   normal

... or a silly sleepless webmaster !?

Hi,

Version
==
I use the GNU wget version 1.7 which is found on
OpenBSD Release 3.3 CD.
I use it on i386 architecture.


How to reproduce
==
wget -r coolibri.com
(adding the "span hosts" option did not improve)


Problem category
=
There seems to be a problem with prepending wrong hostnames.


Problem more detailed

Between fine GETs there are lots of 404s caused by prepending
wrong hostnames. That website consists of several parts
distributed on several "subdomains".

coolibri.com

cpu-kuehler.coolibri.com
luefter.coolibri.com
etc.


Example:
=
wget tries to get files that are located on "cpu-kuehler.coolibri.com"
but does not prepend "cpu-kuehler.coolibri.com" but "coolibri.com"
only.

Instead of (correct)
http://cpu-
kuehler.coolibri.com/80_Kuehler_Grafik_Grafikkarte_/80_kuehler_grafik_
grafikkarte_ .html

it tries (incorrect)
http://coolibri.com/80_Kuehler_Grafik_Grafikkarte_/80_kuehler_grafik_g
rafikkarte_.h tml


Tried my best not to waste your time - but some lack
of sleep during last week was not really helpful ;-)

Best regards

Klaus

--- End of forwarded message ---


Hostname bug in wget ...

2003-09-04 Thread webmaster
... or a silly sleepless webmaster !?

Hi,

Version
==
I use the GNU wget version 1.7 which is found on
OpenBSD Release 3.3 CD.
I use it on i386 architecture.


How to reproduce
==
wget -r coolibri.com
(adding the "span hosts" option did not improve)


Problem category
=
There seems to be a problem with prepending wrong hostnames.


Problem more detailed

Between fine GETs there are lots of 404s caused by prepending
wrong hostnames. That website consists of several parts
distributed on several "subdomains".

coolibri.com

cpu-kuehler.coolibri.com
luefter.coolibri.com
etc.


Example:
=
wget tries to get files that are located on "cpu-kuehler.coolibri.com"
but does not prepend "cpu-kuehler.coolibri.com" but "coolibri.com" only.

Instead of (correct)
http://cpu-
kuehler.coolibri.com/80_Kuehler_Grafik_Grafikkarte_/80_kuehler_grafik_grafikkarte_
.html

it tries (incorrect)
http://coolibri.com/80_Kuehler_Grafik_Grafikkarte_/80_kuehler_grafik_grafikkarte_.h
tml


Tried my best not to waste your time - but some lack
of sleep during last week was not really helpful ;-)

Best regards

Klaus



recursive & no-parent bug in 1.8.2

2003-09-01 Thread John Wilkes
I recently upgraded to wget 1.8.2 from an unknown earlier version.  In
doing recursive http retrievals, I have noticed inconsistent behavior.
If I specifiy a directory without the trailing slash in the url, the
"--no-parent" option is ignored, but if the trailing slash is present,
it works as expected.  Seems like others would have noticed this long
before I did; is it a known problem?  I did not have the problem with
the earlier version of wget I was using previously.

Examples:

This will retrieve the pf001014.shnf directory and its entire contents
including subdirectories:

wget --debug --output-file=test2.log -nH -r -np --cut-dirs=1 
http://www.gdlive.com/philshn/pf001014.shnf/

This will retrieve the pf001014.shnf directory, it's entire contents,
and all parallel directories and their entire contents.  (And fill up
my partition before it finishes.)

wget --debug --output-file=test1.log -nH -r -np --cut-dirs=1 
http://www.gdlive.com/philshn/pf001014.shnf


If this is a known problem, is there a patch I can apply?  Any help
would be greatly appreciated.

/jw

-- 
inet: [EMAIL PROTECTED]   | Fascism conceives of the State as an
addr: 321 High School Rd. NE  #367  | absolute, in comparison with which all
city: Bainbridge Island, Washington | individuals or groups are relative,
code: 98110-1697| only to be conceived of in their
icbm: 47 37 48 N / 122 29 52 W  | relation to the State.  - B. Mussolini



RE: Bug in total byte count for large downloads

2003-08-26 Thread Herold Heiko
Wget 1.5.3 is ancient.
You should be well advised to upgrade to the current stable version (1.8.2)
or better the latest development version (1.9beta) even if wget is currently
in develpment stasis due to lack of maintainer.
You can find more information how to get the sources at
http://wget.sunsite.dk/
There are about 35 user visible changes mentioned in the "news" file after
1.5.3, so take a look at that before upgrading.
Heiko 

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED]
-- +39-041-5907073 ph
-- +39-041-5907472 fax

> -Original Message-
> From: Stefan Recksiegel 
> [mailto:[EMAIL PROTECTED]
> Sent: Monday, August 25, 2003 6:49 PM
> To: [EMAIL PROTECTED]
> Subject: Bug in total byte count for large downloads
> 
> 
> Hi,
> 
> this may be known, but
> 
> [EMAIL PROTECTED]:/scratch/suse82> wget --help
> GNU Wget 1.5.3, a non-interactive network retriever.
> 
> gave me
> 
> FINISHED --18:32:38--
> Downloaded: -1,713,241,830 bytes in 5879 files
> 
> while
> 
> [EMAIL PROTECTED]:/scratch/suse82> du -c
> 6762560 total
> 
> would be correct.
> 
> Best wishes,  Stefan
> 
> -- 
> 
> * Stefan Recksiegelstefan AT recksiegel.de *
> * Physikdepartment T31 office +49-89-289-14612 *
> * Technische Universität München home +49-89-9547 4277 *
> * D-85747 Garching, Germanymobile +49-179-750 2854 *
> 
> 
> 


Bug in total byte count for large downloads

2003-08-25 Thread Stefan Recksiegel
Hi,

this may be known, but

[EMAIL PROTECTED]:/scratch/suse82> wget --help
GNU Wget 1.5.3, a non-interactive network retriever.
gave me

FINISHED --18:32:38--
Downloaded: -1,713,241,830 bytes in 5879 files
while

[EMAIL PROTECTED]:/scratch/suse82> du -c
6762560 total
would be correct.

Best wishes,  Stefan

--

* Stefan Recksiegelstefan AT recksiegel.de *
* Physikdepartment T31 office +49-89-289-14612 *
* Technische Universität München home +49-89-9547 4277 *
* D-85747 Garching, Germanymobile +49-179-750 2854 *




WGET 1.9 bug? it doesn't happen in 1.8.2!!!!!

2003-08-19 Thread jmsbc

   Well i had replaced 1.8.2 with 1.9 b/c of the timeout fix, which was nice.  Now 
i've come across a problem that does not occur in 1.8.2.  if i give the command... 

./wget -T 15 -r -l 15 -D edu http://www.psu.edu

in wget 1.9, it will download stuff for a couple seconds, and then stop at this 
statement...

Loading robots.txt; please ignore errors.
--22:07:35--  https://portal.psu.edu/robots.txt
   => `/har/tmp1/jmsbc/Wgetoutput2/portal.psu.edu/robots.txt'
Connecting to portal.psu.edu[128.118.2.78]:443... connected.
HTTP request sent, awaiting response... 200 OK

And it will sit there for about 15 minutes.  When i ran the same command in 1.8.2 it 
ran fine.  Can anyone help please?  Thanks a lot!



bug: no check accept domain when server redirect

2003-08-14 Thread Василевский Сергей
I use wget 1.8.2:
-r -nH -P /usr/file/somehost.com somehost.com http://somehost.com
Bug description:
If some script http://somehost.com/cgi-bin/rd.cgi return http header with
status
302 and redirect to http://anotherhost.com then
the first page of http://anotherhost.com/index.html accepted and save to
file
that replace http://somehost.com/index.html on the dir
/usr/file/somehost.com/.



bug: no check accept domain when server redirect

2003-08-14 Thread Василевский Сергей
I use wget 1.8.2:
-r -nH -P /usr/file/somehost.com somehost.com http://somehost.com
Bug description:
If some script http://somehost.com/cgi-bin/rd.cgi return http header with
status
302 and redirect to http://anotherhost.com then
the first page of http://anotherhost.com/index.html accepted and save to
file
that replace http://somehost.com/index.html on the dir
/usr/file/somehost.com/.



Re: bug in --spider option

2003-08-14 Thread Aaron S. Hawley
On Mon, 11 Aug 2003, dEth wrote:

> Hi everyone!
>
> I'm using wget to check if some files are downloadable, I also use to
> determine the size of the file. Yesterday I noticed that wget
> ignores --spider option for ftp addresses.
> It had to show me the filesize and other parameters, but it began to
> download the file :( That's too bad. Can anyone fix it? My only idea
> was to shorten the time of work using supported options, so that the
> downloading would be aborted. That's a user's solution, now a
> programmer's one is needed.

http://www.google.com/search?q=wget+spider+ftp

> The other problem is that wget dowesn't correctly replace hrefs in
> downloaded pages (it uses hostname of the local machine to replace
> remote hostname and there's no feature to give any other base-url the
> -B option is for another purpose.) If anyone is interested, I can
> describe the problem more detailed. If it won't be fixed, I'll write
> a perl script to replace base urls after wget downloads pages I need,
> but that's not the best way).

would this option help?:

GNU Wget 1.8.1, a non-interactive network retriever.
Usage: wget [OPTION]... [URL]...
..
  -k,  --convert-links  convert non-relative links to relative.
..


Bug when continuing download and requesting non-existent file over proxy

2003-08-14 Thread Kilian Hagemann
Hi there,

I'm pretty sure that I found a bug in the latest (at least according to the 
FreeBSD ports tree) version of wget.

It occurs upon continuing a partially retrieved file using a proxy. I set my 
ftp_proxy environment variable appropriately.

The FreeBSD ports mechanism, in case you're not familiar with it, checks a 
whole bunch of sites for availability of the zipped source tarball when 
installing a new port. Sometimes downloads of these fail, so it tries again 
cause I told it to use wget instead of fetch.

However, eventually the ports system issues a command similar to the 
following. I actually ran this separately to confirm(and verified that there 
was a partial file with that name in the c):

wget -c -t 5 
ftp://ftp.cs.uct.ac.za/pub/FreeBSD/distfiles/qt-x11-free-3.1.2.tar.bz2
--21:11:01--  
ftp://ftp.cs.uct.ac.za/pub/FreeBSD/distfiles/qt-x11-free-3.1.2.tar
.bz2
   => `qt-x11-free-3.1.2.tar.bz2'
Resolving cache.uct.ac.za... done.
Connecting to cache.uct.ac.za[137.158.128.107]:8080... connected.
Proxy request sent, awaiting response... 404 Not Found
The file is already fully retrieved; nothing to do.

Now the file requested is not present on the server, but wget seems to 
misinterpret the proxy server's response and then thinks the file download is 
complete.

When I try the above command with ftp_proxy variable set to "" and wget 
doesn't use a proxy, it correctly goes into the right directory and finds out 
the file does not exist and exits. Also, when I delete the partial file and 
repeat the same command, it works as it should.

The proxy cache.uct.ac.za runs squid/2.5.STABLE3 in case you need to know. It 
might well be a bug in squid... :-(

Let me know if you need more info.

Kilian



Bug, feature or my fault?

2003-08-14 Thread DervishD
Hi all :))

After asking in the wget list (with no success), and after having
a look at the sources (a *little* look), I think that this is a bug,
so I've decided to report here.
 
Let's go to the matter: when I download, thru FTP, some
hierarchy, the spaces are translated as '%20'. Nothing wrong with
that, of course, it is a safe behaviour. In fact, when files with
spaces in their names are saved, the %20 is translated back to a real
space. But its not the case for directóries, that are created with
%20 treated as three characters, literally the sequence "%20". An
example, server side:

A directory/With a file in it.txt

while downloading we can see:

A%20directory

and

With%20a%20file%20in%20it.txt

and locally they are stored as:

A%20directory/With a file in it.txt

Moreover, if the entire file name is passed to wget in the URL,
the entire filename has the annoying '%20' instead of unquoted
spaces. It seems that only the part that wget 'discovers' when
crawling is unquoted.

This looks like a bug for me... For the details, I'm using wget
version 1.8.2, and the only options passed are '-c', '-r' and '-nH'
(and, of course, the URL ;))), and obviously the wgetrc settings... I
don't think that any option is affecting this. The complete command
line is:

    $ wget -c -r -nH "ftp://user:[EMAIL PROTECTED]/A Directory/*"
 
Any help? Is this a bug? If you want to test I can provide you
with a test account in the server that is causing the problem (I
don't know of any other site that has directories and filenames with
spaces or other quoteable chars), any output from 'wget -d', my
global wgetrc or any other thing you need.

The bug seems to be in the handling of the struct url that is
passed to the ftp module. It seems that the unquoted members (dir and
file) are not used to form the filename that will be stored on disk
:??? Currently I have not much spare time to look for the bug, but if
I can test something you can count on it.

Thanks a lot in advance :)

Raúl Núñez de Arenas Coronado

-- 
Linux Registered User 88736
http://www.pleyades.net & http://raul.pleyades.net/


Wget 1.8.2 timestamping bug

2003-08-11 Thread Angelo Archie Amoruso
Hi All,
I'm using Wget 1.8.2 on a Redhat 9.0 box equipped with 
Athlon 550 MHz cpu, 128 MB Ram.

I've encountered a strange issue, which seem really a bug, using the 
timestamping option.

I'm trying to retrieve the http://www.nic.it/index.html page.
The HEAD HTTP method returns that page is 2474 bytes long
and Last Modified on Wed, 30 Oct 2002.

Using wget I retrieve it (using -N) on /tmp and I get :

(ls -l --time-style=long)

-rw-r--r--1 root root 2474 2002-10-30 15:53 index.html


Then running again wget with -N I get 
"The sizes do not match (local 91941) -- retrieving"

And on /tmp I get again:
-rw-r--r--1 root root 2474 2002-10-30 15:53 index.html

What's happening? Does Wget check creation time file time, which
is obviously :

 -rw-r--r--1 root root 2474 2003-08-05 12:28 index.html



Thanks for your time and cooperation.
Please reply by email.

Below you'll find actual output

= HEAD ==
Trying 193.205.245.10...
Connected to www.nic.it.
Escape character is '^]'.
GET /index.html HTTP/1.0

HTTP/1.1 200 OK
Date: Tue, 05 Aug 2003 10:11:26 GMT
Server: Apache/2.0.45 (Unix) mod_ssl/2.0.45 OpenSSL/0.9.7a
Last-Modified: Wed, 30 Oct 2002 14:53:58 GMT
ETag: "2bc04-9aa-225d2d80"
Accept-Ranges: bytes
Content-Length: 2474
Connection: close
Content-Type: text/html; charset=ISO-8859-1

I run wget with the following parameters:
  wget -N -O /tmp/index.html


== GET OUTPUT 
[EMAIL PROTECTED] celldataweb]# wget -N -O /tmp/index.html 
http://www.nic.it/index.html

--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Resolving www.nic.it... done.
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]
The sizes do not match (local 91941) -- retrieving.
--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]

100%[>] 2,474 38.35K/sETA 
00:00

12:18:32-rw-r--r--1 root root 2474 Oct 30  2002 index.html 
(38.35 KB/s) - 

On /tmp:

-rw-r--r--1 root root 2474 Oct 30  2002 index.html


When I try again:

[EMAIL PROTECTED] celldataweb]# wget -N -O /tmp/index.html 
http://www.nic.it/index.html


=== SECOND GET OUTPUT ===
--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Resolving www.nic.it... done.
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]
The sizes do not match (local 91941) -- retrieving.
--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]

100%[>] 2,474 38.35K/sETA 
00:00

12:18:32 (38.35 KB/s) - `/tmp/index.html' saved [2474/2474]

But on /tmp:

-rw-r--r--1 root root 2474 Oct 30  2002 index.html

What is happening?

-- 
To The Kernel And Beyond!


bug in --spider option

2003-08-11 Thread dEth
Hi everyone!

I'm using wget to check if some files are downloadable, I also use to
determine the size of the file. Yesterday I noticed that wget
ignores --spider option for ftp addresses.
It had to show me the filesize and other parameters, but it began to
download the file :( That's too bad. Can anyone fix it? My only idea
was to shorten the time of work using supported options, so that the
downloading would be aborted. That's a user's solution, now a
programmer's one is needed.

The other problem is that wget dowesn't correctly replace hrefs in
downloaded pages (it uses hostname of the local machine to replace
remote hostname and there's no feature to give any other base-url the
-B option is for another purpose.) If anyone is interested, I can
describe the problem more detailed. If it won't be fixed, I'll write
a perl script to replace base urls after wget downloads pages I need,
but that's not the best way).
-- 
Best regards,
 dEth  mailto:[EMAIL PROTECTED]

PS Is there any list of known bugs?



Wget 1.8.2 timestamping bug

2003-08-10 Thread Angelo Archie Amoruso
Hi All,
I'm using Wget 1.8.2 on a Redhat 9.0 box equipped with
Athlon 550 MHz cpu, 128 MB Ram.

I've encountered a strange issue, which seem really a bug, using the
timestamping option.

I'm trying to retrieve the http://www.nic.it/index.html page.
The HEAD HTTP method returns that page is 2474 bytes long
and Last Modified on Wed, 30 Oct 2002.

[...]

The issue seems related to the "-O" option.
Infact, retrieving the index.html page using twice

wget -N http://www.nic.it/index.html 

the index.html page is created one time only (on current directory)

instead running twice:

wget -N -O index.html http://www.nic.it/index.html 

wget always complaints about "size mismatch, local 0" and
retrieve both times the page!!!


--
To The Kernel And Beyond!



Re: Bug, feature or my fault?

2003-08-08 Thread Aaron S. Hawley
On Wed, 6 Aug 2003, DervishD wrote:

> Hi all :))
>
> After asking in the wget list (with no success), and after having
> a look at the sources (a *little* look), I think that this is a bug,
> so I've decided to report here.

note, the bug and the help lists are currently the same list.

[snip]


RE: Wget 1.8.2 timestamping bug

2003-08-06 Thread Post, Mark K
Angelo,

It works for me:
# wget -N http://www.nic.it/index.html
--13:04:39--  http://www.nic.it/index.html
   => `index.html'
Resolving www.nic.it... done.
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]

100%[>] 2,474
142.12K/sETA 00:00

13:04:44 (142.12 KB/s) - `index.html' saved [2474/2474]

[EMAIL PROTECTED]:/tmp# wget -N http://www.nic.it/index.html
--13:04:49--  http://www.nic.it/index.html
   => `index.html'
Resolving www.nic.it... done.
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]
Server file no newer than local file `index.html' -- not retrieving.

[EMAIL PROTECTED]:/tmp# wget -V
GNU Wget 1.8.2


Are you perhaps behind a firewall?  At my work location, I frequently run
into cases where the firewall does not correctly pass date and timestamp
information back to wget.


Mark Post

-Original Message-
From: Angelo Archie Amoruso [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 05, 2003 6:36 AM
To: [EMAIL PROTECTED]
Subject: Wget 1.8.2 timestamping bug


Hi All,
I'm using Wget 1.8.2 on a Redhat 9.0 box equipped with 
Athlon 550 MHz cpu, 128 MB Ram.

I've encountered a strange issue, which seem really a bug, using the 
timestamping option.

I'm trying to retrieve the http://www.nic.it/index.html page.
The HEAD HTTP method returns that page is 2474 bytes long
and Last Modified on Wed, 30 Oct 2002.

Using wget I retrieve it (using -N) on /tmp and I get :

(ls -l --time-style=long)

-rw-r--r--1 root root 2474 2002-10-30 15:53 index.html


Then running again wget with -N I get 
"The sizes do not match (local 91941) -- retrieving"

And on /tmp I get again:
-rw-r--r--1 root root 2474 2002-10-30 15:53 index.html

What's happening? Does Wget check creation time file time, which
is obviously :

 -rw-r--r--1 root root 2474 2003-08-05 12:28 index.html



Thanks for your time and cooperation.
Please reply by email.

Below you'll find actual output

= HEAD ==
Trying 193.205.245.10...
Connected to www.nic.it.
Escape character is '^]'.
GET /index.html HTTP/1.0

HTTP/1.1 200 OK
Date: Tue, 05 Aug 2003 10:11:26 GMT
Server: Apache/2.0.45 (Unix) mod_ssl/2.0.45 OpenSSL/0.9.7a
Last-Modified: Wed, 30 Oct 2002 14:53:58 GMT
ETag: "2bc04-9aa-225d2d80"
Accept-Ranges: bytes
Content-Length: 2474
Connection: close
Content-Type: text/html; charset=ISO-8859-1

I run wget with the following parameters:
  wget -N -O /tmp/index.html


== GET OUTPUT 
[EMAIL PROTECTED] celldataweb]# wget -N -O /tmp/index.html 
http://www.nic.it/index.html

--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Resolving www.nic.it... done.
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]
The sizes do not match (local 91941) -- retrieving.
--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]

100%[>] 2,474 38.35K/sETA 
00:00

12:18:32-rw-r--r--1 root root 2474 Oct 30  2002 index.html 
(38.35 KB/s) - 

On /tmp:

-rw-r--r--1 root root 2474 Oct 30  2002 index.html


When I try again:

[EMAIL PROTECTED] celldataweb]# wget -N -O /tmp/index.html 
http://www.nic.it/index.html


=== SECOND GET OUTPUT ===
--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Resolving www.nic.it... done.
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]
The sizes do not match (local 91941) -- retrieving.
--12:18:31--  http://www.nic.it/index.html
   => `/tmp/index.html'
Connecting to www.nic.it[193.205.245.10]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2,474 [text/html]

100%[>] 2,474 38.35K/sETA 
00:00

12:18:32 (38.35 KB/s) - `/tmp/index.html' saved [2474/2474]

But on /tmp:

-rw-r--r--1 root root 2474 Oct 30  2002 index.html

What is happening?

-- 
To The Kernel And Beyond!


Timeout bug (1.8.2)

2003-08-03 Thread Andrey Sergeev
wget -gON -t3 -N -w60 -T10 -c --passive-ftp ftp://[EMAIL PROTECTED]/lastday/*.*

Wget: BUG: unknown command `timeout', value `10'

Sometimes wget can fall asleep. it would be nice to have normal timeout.


Re: wget bug: mirror doesn't delete files deleted at the source

2003-08-01 Thread Aaron S. Hawley
On Fri, 1 Aug 2003, Mordechai T. Abzug wrote:

> I'd like to use wget in mirror mode, but I notice that it doesn't
> delete files that have been deleted at the source site.  Ie.:
>
>   First run: the source site contains "foo" and "bar", so the mirror now
>   contains "foo" and "bar".
>
>   Before second run: the source site deletes "bar" and replaces it with
>   "ook", and the mirror is run again.
>
>   After second run: the mirror now contains "foo", "bar", and "ook".
>
> This is not usually the way that mirrors work; wget should delete
> "bar" if it's not at the site.

i don't disagree on your definition of "mirrors", but in Unix (and in GNU)
its usually customary not to delete files without user permission.

http://www.google.com/search?q=wget+archives+delete+mirror+site%3Ageocrawler.com


wget bug: mirror doesn't delete files deleted at the source

2003-07-31 Thread Mordechai T. Abzug

I'd like to use wget in mirror mode, but I notice that it doesn't
delete files that have been deleted at the source site.  Ie.:

  First run: the source site contains "foo" and "bar", so the mirror now
  contains "foo" and "bar".

  Before second run: the source site deletes "bar" and replaces it with
  "ook", and the mirror is run again.

  After second run: the mirror now contains "foo", "bar", and "ook".

This is not usually the way that mirrors work; wget should delete
"bar" if it's not at the site.

- Morty


wget for win32; small bug

2003-06-21 Thread Mark
although this is a windows bug, it effects this program.

when leeching files with the name "prn" "com1" eg. prn.html
wget will freeze up becuse windows will not allow it to save a
 file with that name.

A possable soultion, saving the file as "prn_.html"

just a suggestion.

-pionig

<    1   2   3   4   5   6   7   8   9   >