Re: wget bug with ftp/passive

2004-08-12 Thread Jeff Connelly
On Wed, 21 Jan 2004 23:07:30 -0800, you wrote:
>Hello,
>I think I've come across a little bug in wget when using it to get a file
>via ftp.
>
>I did not specify the "passive" option, yet it appears to have been used
>anyway Here's a short transcript:
Passive FTP can be specified in /etc/wgetrc or /usr/local/etc/wgetrc, and then
its impossible to turn it off. There is no --active-mode flag as far
as I can tell.

I submitted a patch to wget-patches under the title of 
"Patch to add --active-ftp and make --passive-ftp default", which does
what it says.
Your configuration is setting passive mode to default, but the stock
wget defaults
to active (active mode doesn't work too well behind some firewalls).
--active-ftp is
a very useful option in these cases.

Last I checked, the patch hasn't been committed. I can't find the wget-patches
mail archives anywhere, either. So I'll paste it here, in hopes that it helps.

-Jeff Connelly

=cut here=
Common subdirectories: doc.orig/ChangeLog-branches and doc/ChangeLog-branches
diff -u doc.orig/wget.pod doc/wget.pod
--- doc.orig/wget.pod   Wed Jul 21 20:17:29 2004
+++ doc/wget.podWed Jul 21 20:18:56 2004
@@ -888,12 +888,17 @@
 system-specific.  This is why it currently works only with Unix FTP
 servers (and the ones emulating Unix C output).

+=item B<--active-ftp>
+
+Use the I FTP retrieval scehme, in which the server
+initiates the data connection. This is sometimes required to connect
+to FTP servers that are behind firewalls.

 =item B<--passive-ftp>

 Use the I FTP retrieval scheme, in which the client
 initiates the data connection.  This is sometimes required for FTP
-to work behind firewalls.
+to work behind firewalls, and as such is enabled by default.


 =item B<--retr-symlinks>
Common subdirectories: src.orig/.libs and src/.libs
Common subdirectories: src.orig/ChangeLog-branches and src/ChangeLog-branches
diff -u src.orig/init.c src/init.c
--- src.orig/init.c Wed Jul 21 20:17:33 2004
+++ src/init.c  Wed Jul 21 20:17:59 2004
@@ -255,6 +255,7 @@
   opt.ftp_glob = 1;
   opt.htmlify = 1;
   opt.http_keep_alive = 1;
+  opt.ftp_pasv = 1;
   opt.use_proxy = 1;
   tmp = getenv ("no_proxy");
   if (tmp)
diff -u src.orig/main.c src/main.c
--- src.orig/main.c Wed Jul 21 20:17:33 2004
+++ src/main.c  Wed Jul 21 20:17:59 2004
@@ -217,7 +217,8 @@
 FTP options:\n\
   -nr, --dont-remove-listing   don\'t remove `.listing\' files.\n\
   -g,  --glob=on/off   turn file name globbing on or off.\n\
-   --passive-ftp   use the \"passive\" transfer mode.\n\
+   --passive-ftp   use the \"passive\" transfer mode (default).\n\
+   --active-ftpuse the \"active\" transfer mode.\n\
--retr-symlinks when recursing, get linked-to files (not dirs).\
n\
 \n"), stdout);
   fputs (_("\
@@ -285,6 +286,7 @@
 { "no-parent", no_argument, NULL, 133 },
 { "non-verbose", no_argument, NULL, 146 },
 { "passive-ftp", no_argument, NULL, 139 },
+{ "active-ftp", no_argument, NULL, 167 },
 { "page-requisites", no_argument, NULL, 'p' },
 { "quiet", no_argument, NULL, 'q' },
 { "random-wait", no_argument, NULL, 165 },
@@ -397,6 +399,9 @@
case 139:
  setval ("passiveftp", "on");
  break;
+case 167:
+  setval ("passiveftp", "off");
+  break;
case 141:
  setval ("noclobber", "on");
  break;


trying to wget all of a remote page nearly works

2004-08-12 Thread Mark Pilon

Hi,
I'm trying to do something which seems really simple, but can't
get it to work for me (actually 2 approaches) -- and I'd appreciate
a word via email reply if what I'm doing is sane and/or possible
w/ wget:
approach #1:
ideally, I'd like to pull all of a web site,
say http://www.cnn.com, save its root doc to some filename
like cnn_com.html with all the child images in a dir: cnn_com_images.
i.e. I'd like to save a copy of cnn.com, w/ images, so that it can
be browsed later w/o a network connection.  I prefer a flat directory
structure but understand the need for wget's directory-per-host
organization.
but I haven't been able to see how to extract the root doc.
approach #2:
instead I'm running wget inside one subdir for the whole doc:
# mkdir cnn_com
# cd cnn_com
# wget -p -E -H -k -nd -nH -d -o wget.run http://www.cnn.com
the images are pulled and placed in the current dir, and _nearly_
all the links are fixed up in index.html to point to the current
subdir except for:
http://i.cnn.net/cnn/images/1.gif"; ...>
[ 1.gif _is_ pulled ]
the debug output is:
.
.
.
index.html: merge("http://www.cnn.com/";, "#ContentArea") -> 
http://www.cnn.com/#ContentArea
appending "http://www.cnn.com/"; to urlpos.
index.html: merge("http://www.cnn.com/";, "http://i.cnn.net/cnn/images/1.gif";) 
-> http://i.cnn.net/cnn/images/1.gif
appending "http://i.cnn.net/cnn/images/1.gif"; to urlpos.
index.html: merge("http://www.cnn.com/";, "http://www.cnn.com/";) -> 
http://www.cnn.com/
appending "http://www.cnn.com/"; to urlpos.
.
.
.

IF I omit options -nd -nH then all references in the root doc are
fixed up to point to the local copy ... but I lose my flat dir
structure.
I would appreciate a sanity check on what I'm trying to do --
essentially create static test cases for browser development ...
Can I do what I want to do w/ wget ?
All replies greatly appreciated - please reply by email to
[EMAIL PROTECTED] as I'm not subscribed to the wget mailing list.
thanks,
Mark


Re: get link Internal Server Error

2004-08-12 Thread jens . roesner
For me this link does NOT work in
IE 6.0
latest Mozilla
latest Opera

So I tested a bit further.
If you go to the site and reach 
http://www.interwetten.com/webclient/start.html
and then use the URL you provide, it works.
A quick check for stored cookies revealed that 
two cookies are stored.
So you have to use wget with cookies.
For info on how to do that, use the manual. 

CU
Jens


> hi all:
> some link use IE open is normal,but use wget download have
> somewrong, i cant find some slove way, i think it maybe a bug :
> example link:
> 
>
http://www.interwetten.com/webclient/betting/offer.aspx?type=1&kindofsportid=10&L=EN
> this link use IE open is ok,but use wget have this  wrong,
> Connecting to www.interwetten.com[213.185.178.21]:80... connected.
> HTTP request sent, awaiting response... 500 Internal Server Error
> 01:02:27 ERROR 500: Internal Server Error.
>  henryluo
> 

-- 
NEU: WLAN-Router für 0,- EUR* - auch für DSL-Wechsler!
GMX DSL = supergünstig & kabellos http://www.gmx.net/de/go/dsl



ftp download "ignoring" length

2004-08-12 Thread Alessandro Tinivelli
Hi all,

I have a shell script that everyday uses wget (version installed is
1.5.3) to get via FTP a quite big (around 100 Mb) .zip file.
The line command is

#wget -o log.txt ftp://.

The problem is that sometimes in the log file I read

"Length: 106,670,952 (unauthoritative)"

that's correct, but at the end of log it says

" 22650K .. .. .. .. .. 21%
429.91 KB/s

02:51:22 (384.46 KB/s) - `pdf20040812.zip' saved [23240704]"

Why this may happen? Maybe a bug of the (old) version installed? Or a
misbehavior of the FTP server? Is it possible to force wget to behave as
in http download where the content length is checked to validate the
downloaded file? Man pages seem to say no.

If I use the "-c" option ('wget -c -o log.txt ftp://...' repeated many
times it would finally download the correct file) it's not as good to
solve this problem, because I would have to ask the author to modify the
shell script (now it waits for wget's end to start processing the file,
so it would work on the corrupted file if he don't add a check that
reads the logfile searching for "100%"). 



--
Alessandro Tinivelli