RE: Bug in GNU Wget 1.x (Win32)

2006-06-22 Thread Herold Heiko
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] Behalf Of Þröstur
> Sent: Wednesday, June 21, 2006 4:35 PM

There have been some reports in the past but I don't think it has been acted
upon; one of the problems is that the list of names can be extended at will
(beside the standard comx, lptx, con, prn). Maybe it is possible to query
the os about the currently active device names and rename the output files
if neccessary ?

>   I reproduced the bug with Win32 versions 1.5.dontremeber,
> 1.10.1 and 1.10.2. I did also test version 1.6 on Linux but it
> was not affected.

That is since the problem is generated by the dos/windows filesystem drivers
(or whatever those should be called), basically com1* and so on are
equivalent of unix device drivers, with the unfortunate difference of acting
in every directory. 

> 
> Example URLs that reproduce the bug :
> wget g/nul
> wget http://www.gnu.org/nul
> wget http://www.gnu.org/nul.html
> wget -o loop.end "http://www.gnu.org/nul.html";
> 
>   I know that the bug is associated with words which are
> devices in the windows console, but i don't understand
> why, since I tried to set the output file to something else.

I think you meant to use -O, not -o.
Doesn't solve the real problem but at least a workaround.

Heiko 

-- 
-- PREVINET S.p.A. www.previnet.it
-- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED]
-- +39-041-5907073 / +39-041-5917073 ph
-- +39-041-5907472 / +39-041-5917472 fax


Problem when timeout

2006-06-22 Thread Oliver Schulze L.

Hi,
I'm having a problem while downloading from a Microsoft FTP server.

The problem is that the connection is timeout/close while downloading,
then wget retry to download the file, but it receives a "file not found" 
error.


Is this problem with the MS server or wget?

Here is tog of the error, Thanks
Oliver
-
--03:11:51--  
ftp://user1:[EMAIL PROTECTED]//long/path/to/file/File01%20de%205%20horas%20%2056%20-%2083.ppt
  => `ftp.example.com/long/path/to/file/File01 de 5 horas  56 - 
83.ppt'

==> CWD /long/path/to/file ... done.
==> PASV ... done.==> REST 2679100 ... done.
==> RETR File01 de 5 horas  56 - 83.ppt ... done.
Length: 51,856,384 (49M), 49,177,284 (47M) remaining

  [ skipping 2600K ]
2600K ,, ,, .. .. ..  5%   
16.95 KB/s
2650K .. .. .. .. ..  5%   
54.81 KB/s
2700K .. .. .. .. ..  5%   
64.26 KB/s
2750K .. .. .. .. ..  5%   
57.79 KB/s
2800K .. .. .. .. ..  5%   
65.73 KB/s

2850K .. .
...
11950K .. .. .. .. .. 23%   
48.25 KB/s
12000K .. .. .. .. .. 23%   
23.22 KB/s
12050K .. ..  23%
2.16 MB/s


03:16:03 (42.86 KB/s) - Data connection: Connection timed out; Control 
connection closed.

Retrying.

--03:16:34--  
ftp://user1:[EMAIL PROTECTED]//long/path/to/file/File01%20de%205%20horas%20%2056%20-%2083.ppt
 (try: 2) => `ftp.example.com/long/path/to/file/File01 de 5 horas  56 - 
83.ppt'

Connecting to ftp.example.com|123.123.123.123|:21... connected.
Logging in as user1 ... Logged in!
==> SYST ... done.==> PWD ... done.
==> TYPE I ... done.  ==> CWD not required.
==> PASV ... done.==> REST 12360360 ... done.
==> RETR File01 de 5 horas  56 - 83.ppt ...
No such file `File01 de 5 horas  56 - 83.ppt'.

The sizes do not match (local 3309820) -- retrieving.

--03:16:41--  
ftp://user1:[EMAIL PROTECTED]//long/path/to/file/File01%20de%205%20horas%20-%2084%20to%20104.ppt
  => `ftp.example.com/long/path/to/file/File01 de 5 horas - 84 
to 104.ppt'

==> CWD /long/path/to/file ... done.
==> PASV ... done.==> REST 3309820 ... done.
==> RETR File01 de 5 horas - 84 to 104.ppt ... done.
Length: 30,419,968 (29M), 27,110,148 (26M) remaining

  [ skipping 3200K ]
3200K ,, ,, ,, ,, .. 10%   
10.75 KB/s
3250K .. .. .. .. .. 11%   
39.65 KB/s

3300K .. .. ..

--
Oliver Schulze L.
<[EMAIL PROTECTED]>



wget - tracking urls/web crawling

2006-06-22 Thread bruce
hi...

i'm testing wget on a test site.. i'm using the recursive function of wget
to crawl through a portion of the site...

it appears that wget is hitting a link within the crawl that's causing it to
begin to crawl through the section of the site again...

i know wget isn't as robust as nutch, but can someone tell me if wget keeps
a track of the URLs that it's been through so it doesn't repeat/get stuck in
a never ending processs...

i haven't run across anything in the docs that seems to speak to this
point..

thanks

-bruce




Re: wget - tracking urls/web crawling

2006-06-22 Thread Frank McCown

bruce wrote:

hi...

i'm testing wget on a test site.. i'm using the recursive function of wget
to crawl through a portion of the site...

it appears that wget is hitting a link within the crawl that's causing it to
begin to crawl through the section of the site again...

i know wget isn't as robust as nutch, but can someone tell me if wget keeps
a track of the URLs that it's been through so it doesn't repeat/get stuck in
a never ending processs...

i haven't run across anything in the docs that seems to speak to this
point..

thanks

-bruce




Bruce,

Wget does keep a list of URLs that it has visited in order to avoid 
re-visiting them.  The problem could be due to the URL normalization 
scheme.  When wget crawls


http://foo.org/

it thinks puts this URL on the "visited" list. If it later runs into

http://foo.org/default.htm

which is actually the same as

http://foo.org/

then wget is not aware the URLs are the same, so default.htm will be 
crawled again.  But, any URLs extracted from default.htm should be the 
same as the previous crawl, so they should not be crawled again.


You may want to include a more detailed description of your problem if 
this doesn't help (for example, the command-line arguments, etc.).


Regards,
Frank


RE: wget - tracking urls/web crawling

2006-06-22 Thread bruce
hi frank...

there must be something simple i'm missing...

i'm looking to crawl the site >>>
http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i issue the wget:
 wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i thought that this would simply get everything under the http://...?20071.
however, it appears that wget is getting 20062, etc.. which are the other
semesters...

what i'd really like to do is to simply get 'all depts' for each of the
semesters...

any thoughts/comments/etc...

-bruce



-Original Message-
From: Frank McCown [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 12:12 PM
To: [EMAIL PROTECTED]; wget@sunsite.dk
Subject: Re: wget - tracking urls/web crawling


bruce wrote:
> hi...
>
> i'm testing wget on a test site.. i'm using the recursive function of wget
> to crawl through a portion of the site...
>
> it appears that wget is hitting a link within the crawl that's causing it
to
> begin to crawl through the section of the site again...
>
> i know wget isn't as robust as nutch, but can someone tell me if wget
keeps
> a track of the URLs that it's been through so it doesn't repeat/get stuck
in
> a never ending processs...
>
> i haven't run across anything in the docs that seems to speak to this
> point..
>
> thanks
>
> -bruce
>


Bruce,

Wget does keep a list of URLs that it has visited in order to avoid
re-visiting them.  The problem could be due to the URL normalization
scheme.  When wget crawls

http://foo.org/

it thinks puts this URL on the "visited" list. If it later runs into

http://foo.org/default.htm

which is actually the same as

http://foo.org/

then wget is not aware the URLs are the same, so default.htm will be
crawled again.  But, any URLs extracted from default.htm should be the
same as the previous crawl, so they should not be crawled again.

You may want to include a more detailed description of your problem if
this doesn't help (for example, the command-line arguments, etc.).

Regards,
Frank



RE: wget - tracking urls/web crawling

2006-06-22 Thread Post, Mark K
Try using the -np (no parent) parameter.


Mark Post 

-Original Message-
From: bruce [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 22, 2006 4:15 PM
To: 'Frank McCown'; wget@sunsite.dk
Subject: RE: wget - tracking urls/web crawling

hi frank...

there must be something simple i'm missing...

i'm looking to crawl the site >>>
http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i issue the wget:
 wget -r -np
http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i thought that this would simply get everything under the
http://...?20071.
however, it appears that wget is getting 20062, etc.. which are the
other
semesters...

what i'd really like to do is to simply get 'all depts' for each of the
semesters...

any thoughts/comments/etc...

-bruce




Re: wget - tracking urls/web crawling

2006-06-22 Thread Frank McCown

bruce wrote:

i issue the wget:
 wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071

i thought that this would simply get everything under the http://...?20071.
however, it appears that wget is getting 20062, etc.. which are the other
semesters...


The -np option will keep wget from crawling any URLs that are outside of 
the cgi-bin directory.  That means 20062, etc. *will* be crawled.




what i'd really like to do is to simply get 'all depts' for each of the
semesters...


The problem with the site you are trying to crawl is that its pages are 
hidden behind a web form.  Wget is best at getting pages that are 
directly linked (e.g., using  tag) to other pages.


What I'd recommend doing is creating a list of pages that you want 
crawled.  Maybe you can do this with a script.  Then I'd use the 
--input-file and --page-requisites (no -r) to crawl just those pages and 
get any images, style sheets, etc. that the pages need to display.



Hope that helps,
Frank


RE: wget - tracking urls/web crawling

2006-06-22 Thread bruce
hey frank...

creating a list of pages to parse doesn't do me any good... i really need to
be able to recurse through the underlying pages.. or at least a section of
the pages...

if there was a way that i could insert/use some form of a regex to exclude
urls+querystring that match, then i'd be ok... the pages i need to exclude
are based on information that's in the query portion of the url...

-bruce



-Original Message-
From: Frank McCown [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 22, 2006 2:34 PM
To: [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: Re: wget - tracking urls/web crawling


bruce wrote:
> i issue the wget:
>  wget -r -np http://timetable.doit.wisc.edu/cgi-bin/TTW3.search.cgi?20071
>
> i thought that this would simply get everything under the
http://...?20071.
> however, it appears that wget is getting 20062, etc.. which are the other
> semesters...

The -np option will keep wget from crawling any URLs that are outside of
the cgi-bin directory.  That means 20062, etc. *will* be crawled.


> what i'd really like to do is to simply get 'all depts' for each of the
> semesters...

The problem with the site you are trying to crawl is that its pages are
hidden behind a web form.  Wget is best at getting pages that are
directly linked (e.g., using  tag) to other pages.

What I'd recommend doing is creating a list of pages that you want
crawled.  Maybe you can do this with a script.  Then I'd use the
--input-file and --page-requisites (no -r) to crawl just those pages and
get any images, style sheets, etc. that the pages need to display.


Hope that helps,
Frank



RE: wget - tracking urls/web crawling

2006-06-22 Thread Tony Lewis
Bruce wrote: 

> if there was a way that i could insert/use some form of a regex to exclude
> urls+querystring that match, then i'd be ok... the pages i need to 
> urls+exclude
> are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis