RE: one bug?

2005-03-04 Thread Tony Lewis



Jesus Legido wrote:
 
> I'm getting a file from https://mfi-assets.ecb.int/dla/EA/ea_all_050303.txt:
 
The problem is not with wget. The file on the server 
starts with 0xFF 0xFE. Put the following into an HTML file (say temp.html) on 
your hard drive, open it in your web browser, right click on the link and do a 
"Save As..." to your hard drive. You will get the same thing as wget 
downloaded.
 
ea.txthttps://mfi-assets.ecb.int/dla/EA/ea_all_050303.txt">ea.txt>
 


RE: Curb maximum size of headers

2005-03-16 Thread Tony Lewis
Hrvoje Niksic wrote:

> This patch imposes IMHO reasonable, yet safe, limits for reading server
> responses into memory.

Your choice of default limits looks reasonable to me, but shouldn't wget
provide the user a way to override these limits?

Tony




RE: Curb maximum size of headers

2005-03-17 Thread Tony Lewis
Hrovje Niksic wrote:

> Overriding these limits would require *two* new cryptic command-line
> options that would clutter the code and documentation and in all
likeliness
> would never be used, thus bringing no value to the user.

As I said, I think the proposed limits are reasonable, but what if they are
not for a given user mirroring some website? If I understood the patch, wget
will refuse to download some files and the user will have no way to coax it
to do so.

If I missed something, please explain.

Tony




RE: Curb maximum size of headers

2005-03-17 Thread Tony Lewis
Hrvoje Niksic wrote:

> I don't see how and why a web site would generate headers (not bodies, to
> be sure) larger than 64k.

To be honest, I'm less concerned about the 64K header limit than I am about
limiting a header line to 4096 bytes. I don't know any sites that send back
header lines that long, but they could. Who's to say some site doesn't have
a 4K cookie?

Since you already proposing to limit the entire header to 64K, what is
gained by adding this second limit?

Tony




RE: help!!!

2005-03-21 Thread Tony Lewis
Richard Emanilov wrote:

> Below is what I have tried with no success
>
> wget --http-user=login --http-passwd=passwd
--http-post="login=login&password=passwd"

That should be:
wget --http-user=login --http-passwd=passwd
--post-data="login=login&password=passwd"

Tony




RE: help!!!

2005-03-21 Thread Tony Lewis
The --post-data option was added in version 1.9. You need to upgrade your
version of wget. 

Tony
-Original Message-
From: Richard Emanilov [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 21, 2005 8:49 AM
To: Tony Lewis; [EMAIL PROTECTED]
Cc: wget@sunsite.dk
Subject: RE: help!!!

wget --http-user=login --http-passwd=passwd
--post-data="login=login&password=passwd" https://site

wget: unrecognized option `--post-data=login=login&password=password'
Usage: wget [OPTION]... [URL]... 


wget --http-user=login --http-passwd=passwd
--http-post="login=login&password=password" https:site
wget: unrecognized option `--http-post=login=login&password=passwd'
Usage: wget [OPTION]... [URL]...

Try `wget --help' for more options.

wget -V
GNU Wget 1.8.2


Richard Emanilov
 
[EMAIL PROTECTED]


-Original Message-
From: Tony Lewis [mailto:[EMAIL PROTECTED]
Sent: Monday, March 21, 2005 10:26 AM
To: wget@sunsite.dk
Cc: Richard Emanilov
Subject: RE: help!!!

Richard Emanilov wrote:

> Below is what I have tried with no success
>
> wget --http-user=login --http-passwd=passwd
--http-post="login=login&password=passwd"

That should be:
wget --http-user=login --http-passwd=passwd
--post-data="login=login&password=passwd"

Tony






RE: help!!!

2005-03-21 Thread Tony Lewis
This looks like a bug (or at least an implementation oversight) to me.

Send POST request to https://URL
302 Response with location https://URL/
Send GET request to https://URL/


Richard, try changing your wget command line to include the trailing slash.

Tony
-Original Message-
From: Richard Emanilov [mailto:[EMAIL PROTECTED] 
Sent: Monday, March 21, 2005 2:17 PM
To: Mauro Tortonesi
Cc: Tony Lewis; wget@sunsite.dk; [EMAIL PROTECTED]
Subject: RE: help!!!

/usr/local/bin/wget  -dv --post-data="login=login&password=password"
https://login:[EMAIL PROTECTED]:8443/ft
DEBUG output created by Wget 1.9.1 on linux-gnu.

--17:11:35--  https://login:[EMAIL PROTECTED]:8443/ft
   => `ft'
Connecting to ip... connected.
Created socket 3.
Releasing 0x8123110 (new refcount 0).
Deleting unused 0x8123110.
---request begin---
POST /ft HTTP/1.0
User-Agent: Wget/1.9.1
Host: ip:8443
Accept: */*
Connection: Keep-Alive
Authorization: Basic cm9zZW46Y3VybHlx
Content-Type: application/x-www-form-urlencoded
Content-Length: 30

[POST data: login=login&password=passwd] ---request end--- HTTP request
sent, awaiting response... HTTP/1.1 302 Moved Temporarily
Location: https://login:[EMAIL PROTECTED]:8443/ft
Content-Length: 0
Date: Mon, 21 Mar 2005 22:11:35 GMT
Server: Apache-Coyote/1.1
Connection: Keep-Alive


Registered fd 3 for persistent reuse.
Location: https://ip:8443/ft/ [following] Closing fd 3 Releasing 0x81359a0
(new refcount 0).
Deleting unused 0x81359a0.
Invalidating fd 3 from further reuse.
--17:11:35--  https://ip:8443/ft/
   => `index.html'
Connecting to ip:8443... connected.
Created socket 3.
Releasing 0x8118718 (new refcount 0).
Deleting unused 0x8118718.
---request begin---
GET /ft/ HTTP/1.0
User-Agent: Wget/1.9.1
Host: ip:8443
Accept: */*
Connection: Keep-Alive

---request end---
HTTP request sent, awaiting response... HTTP/1.1 401 Unauthorized
WWW-Authenticate: Basic realm="Portfolio Viewer"
Content-Type: text/html;charset=ISO-8859-1
Content-Language: en-US
Date: Mon, 21 Mar 2005 22:11:35 GMT
Server: Apache-Coyote/1.1
Connection: close


Closing fd 3
Authorization failed. 


Again, I'd like to thank you guys so much, made some progress, any og you
guys familiar with this issue?

-Original Message-
From: Mauro Tortonesi [mailto:[EMAIL PROTECTED]
Sent: Monday, March 21, 2005 4:11 PM
To: Richard Emanilov
Cc: Tony Lewis; wget@sunsite.dk; [EMAIL PROTECTED]
Subject: Re: help!!!

On Monday 21 March 2005 02:22 pm, Richard Emanilov wrote:
> Guys,
>
>
> Thanks so much for your help, when running
>
> wget --http-user=login --http-passwd=passwd 
> --post-data="login=login&password=passwd" https://site
>
> With version 1.9.1, I get the error message
>
> "Site:  Unsupported scheme."

have you compiled wget with SSL support?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi

University of Ferrara - Dept. of Eng.http://www.ing.unife.it
Institute of Human & Machine Cognition   http://www.ihmc.us
Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it




RE: File rejection is not working

2005-04-06 Thread Tony Lewis
Jens Rösner wrote: 

> AFAIK, RegExp for (HTML?) file rejection was requested a few times, but is
> not implemented at the moment.

It seems all the examples people are sending are just attempting to get a
match that is not case sensitive. A switch to ignore case in the file name
match would be a lot easier to implement than regular expressions and solve
the most pressing need.

Just a thought.

Tony




RE: newbie question

2005-04-14 Thread Tony Lewis
Alan Thomas wrote:

> I am having trouble getting the files I want using a wildcard specifier...

There are no options on the command line for what you're attempting to do.

Neither wget nor the server you're contacting understand "*.pdf" in a URI.
In the case of wget, it is designed to read web pages (HTML files) and then
collect a list of resources that are referenced in those pages, which it
then retrieves. In the case of the web server, it is designed to return
individual objects on request (X.pdf or Y.pdf, but not *.pdf). Some web
servers will return a list of files if you specify a directory, but you
already tried that in your first use case.

Try coming at this from a different direction. If you were going to manually
download every PDF from that directory, how would YOU figure out the names
of each one? Is there a web page that contains a list somewhere? If so,
point wget there.

Hope that helps.

Tony

PS) Jens was mistaken when he said that https requires you to log into the
server. Some servers may require authentication before returning information
over a secure (https) channel, but that is not a given.




RE: SSL options

2005-04-21 Thread Tony Lewis
Hrvoje Niksic wrote:

> The question is what should we do for 1.10?  Document the
> unreadable names and cryptic values, and have to support
> them until eternity?

My vote is to change them to more reasonable syntax (as you suggested
earlier in the note) for 1.10 and include the new syntax in the
documentation. However, I think wget should to continue to support the old
options and syntax as alternatives in case people have included them in
scripts.

Tony




RE: links conversion; non-existent index.html

2005-05-01 Thread Tony Lewis
Andrzej wrote:

> Two problems:
>
> There is no index.html under this link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
[snip]
> it creates a non existing link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html

When you specify a directory, it is up to the web server to determine what
resource gets returned. Some web servers will return a directory listing,
some will return some file (such as index.html), and others will return an
error.

For example, Apache might return (in this order): index.html, index.htm, a
directory listing (or a 403 Forbidden response if the configuration
disallows directory listings). The actual list of files that Apache will
search for and the order in which they are selected is determined by the
configuration.

If the web server returns any information, wget has to save the information
that is returned in *some* local file. It chooses to name that local file
"index.html" since it has no way of knowing where the information might have
actually been stored on the server.

Hope that helps,

Tony





RE: Is it just that the -m (mirror) option an impossible task [Was: wget 1.91 skips most files]

2005-05-28 Thread Tony Lewis
Maurice Volaski wrote:

> wget's -m option seems to be able to ignore most of the files it should
> download from a site. Is this simply because wget can download only the
> files it can see? That is, if the web server's directory indexing option
> is off and a page on the site is present on the server, but it isn't
> referenced by any publicly viewable page, wget simply can't see it. 

I've been thinking about coding a --extra-sensory-perception option that
would cause wget to read the mind of the server so that it can download
files that it cannot see. As soon as I get the algorithm worked out, I'll be
submitting the patch. So far I've figured out how to download index.html
without being able to see it, but I'm sure that if I keep working at it that
wget will be able to detect the rest of the files it cannot see. Of course,
I could just be taking the wrong approach; it may work better if I try to
implement the --psychic option instead.

Tony




RE: Removing thousand separators from file size output

2005-06-24 Thread Tony Lewis
Hrvoje Niksic wrote: 

> In fact, I know of no application that accepts numbers as Wget prints
them.

Microsoft Calculator does.

Tony




Name or service not known error

2005-06-27 Thread Tony Lewis



I got a "Name or 
service not known" error from wget 1.10 running on Linux. When I installed an 
earlier version of wget, it worked just fine. It also works just fine on 
version 1.10 running on Windows. Any ideas?
 
Here's the output on 
Linux:
 
wget --versionGNU Wget 1.9-beta1
 
wget http://www.calottery.com/Games/MegaMillions/--17:29:59--  
http://www.calottery.com/Games/MegaMillions/   
=> `index.html.8'Resolving www.calottery.com... 64.164.108.164, 
64.164.108.202Connecting to www.calottery.com[64.164.108.164]:80... 
connected.HTTP request sent, awaiting response... 200 OKLength: 45,166 
[text/html]
 
100%[==>] 
45,166   166.21K/s
 
17:30:01 (166.17 KB/s) - `index.html.8' saved 
[45166/45166]
 

 
wget --versionGNU Wget 1.10
 
wget http://www.calottery.com/Games/MegaMillions/--17:30:17--  
http://www.calottery.com/Games/MegaMillions/   
=> `index.html.9'Resolving www.calottery.com... failed: Name or service 
not known.


RE: Name or service not known error

2005-06-28 Thread Tony Lewis
Hrvoje Niksic wrote:

> 1. Does wget -4 http://... work?

Yes

> 2. Does Wget work when you specify --disable-ipv6 to configure?

Yes

> What OS are you running this on?

Red Hat Linux release 6.2 (Zoot)

Tony




RE: Invalid directory names created by wget

2005-07-08 Thread Tony Lewis
Larry Jones wrote: 

> Of course it's directly accessible -- you just have to quote it to keep
the
> shell from processing the parentheses:
>
>   cd 'title.Die-Struck+(Gold+on+Gold)+Lapel+Pins'

You can also make the individual characters into literals:

cd title.Die-Struck+\(Gold+on+Gold\)+Lapel+Pins

Tony




RE: robots.txt takes precedence over -p

2005-07-10 Thread Tony Lewis
Thomas Boerner wrote: 

> Is this behaviour:  "robots.txt takes precedence over -p" a bug or
> a feature?

It is a feature. If you want to ignore robots.txt, use this command line:

wget -p -k www.heise.de/index.html -e robots=off

Tony




RE: wget a file with long path on Windows XP

2005-07-20 Thread Tony Lewis
You need to quote the URL because it contains characters that are
interpreted by the shell (and therefore are never passed to wget).

Tony
-Original Message-
From: PoWah Wong [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 20, 2005 8:35 PM
To: wget@sunsite.dk
Subject: Re: wget a file with long path on Windows XP

This does not work, please help.


"C:\Program Files\wget\wget" --save-cookies cookies.txt
http://safari.informit.com/?x=1&mode=Logout&sortKey=title&sortOrder=asc&view
=&xmlid=&g=&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=1&n=1&d=1&p=
1&a=0&[EMAIL PROTECTED]&password=12345678
--23:24:23--  http://safari.informit.com/?x=1
   => [EMAIL PROTECTED]'
Resolving safari.informit.com... 193.194.158.208 Connecting to
safari.informit.com|193.194.158.208|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [text/html]

[ <=> 
   ] 0 --.--K/s

23:24:23 (0.00 B/s) - [EMAIL PROTECTED]' saved [0/0]

Invalid parameter - =Logout
'sortKey' is not recognized as an internal or external command, operable
program or batch file.
'sortOrder' is not recognized as an internal or external command, operable
program or batch file.
'view' is not recognized as an internal or external command, operable
program or batch file.
'xmlid' is not recognized as an internal or external command, operable
program or batch file.
'g' is not recognized as an internal or external command, operable program
or batch file.
'catid' is not recognized as an internal or external command, operable
program or batch file.
's' is not recognized as an internal or external command, operable program
or batch file.
'b' is not recognized as an internal or external command, operable program
or batch file.
'f' is not recognized as an internal or external command, operable program
or batch file.
't' is not recognized as an internal or external command, operable program
or batch file.
'c' is not recognized as an internal or external command, operable program
or batch file.
'u' is not recognized as an internal or external command, operable program
or batch file.
'r' is not recognized as an internal or external command, operable program
or batch file.
'o' is not recognized as an internal or external command, operable program
or batch file.
'n' is not recognized as an internal or external command, operable program
or batch file.
'd' is not recognized as an internal or external command, operable program
or batch file.
'p' is not recognized as an internal or external command, operable program
or batch file.
'a' is not recognized as an internal or external command, operable program
or batch file.
'username' is not recognized as an internal or external command, operable
program or batch file.
'password' is not recognized as an internal or external command, operable
program or batch file.


I am not subscribed, please cc'd in replies to my post. Thanks. 

--- Frank McCown <[EMAIL PROTECTED]> wrote:

> This sounds like a difficult page to download because they may be 
> using cookies or session variables.  I'm not sure the best way to 
> proceed, but I would look at the wget documentation about cookies.  I 
> think you may have to save the cookies that are generated by the login 
> page and use --load-cookie to get the page you are after.
> 
> By the way, if you are only after a single page, why not just save it 
> using the browser?
> 
> Frank
> 
> 
> PoWah Wong wrote:
> > The website is actually www.informit.com.
> > It require logging in at
> >
>
https://secure.safaribooksonline.com/promo.asp?code=ITT03&portal=informit&a=
0
> > After logging in, then the website becomes similar
> to
> > booksonline.com which I edit slightly.
> > My public library's electronic access which also require logging in.
> > 
> > 
> > --- Frank McCown <[EMAIL PROTECTED]> wrote:
> > 
> > 
> >>Putting quotes around the url got rid of your "Invalid parameter" 
> >>errors.
> >>
> >>I just tried accessing the url you are trying to wget and received 
> >>an http 500 response.  I also tried accessing 
> >>http://proquest.booksonline.com/ and never got a response.
> >>
> >>According to your output, wget got back a 0 length response.  I 
> >>would check your web server and make sure it is working properly.
> >>
> >>Frank
> >>
> >>
> >>PoWah Wong wrote:
> >>
> >>>I put "quotes" around the url, but it still does
> >>
> >>not
> >>
> >>>work.
> >>>
> >>>C:\book>"C:\Program Files\wget\wget.exe"
> >>>
> >>
> >
>
"http://proquest.booksonline.com/?x=1&mode=section&so
> > 
> >
>
rtKey=title&sortOrder=asc&view=&xmlid=0-321-16076-2/ch03lev1sec1&g=&catid=&s
=1&b=1&f=1&t=1&c=1&u=1&r
> > 
> >>>=&o=1&n=1&d=1&p=1&a=0&page=0"
> >>>--22:45:26--
> >>>
> >>
> >
>
http://proquest.booksonline.com/?x=1&mode=section&sortKey=title&sortOrder=as
c&vi
> > 
> >
>
ew=&xmlid=0-321-16076-2/ch03lev1sec1&g=&catid=&s=1&b=1&f=1&t=1&c=1&u=1&r=&o=
1&n=1&d=1&p=1&a=0&page=0
> > 
> >>>   =>
> >>>
> >
>
[EMAIL PROTECTED]&mode=section&sortKey

RE: wget a file with long path on Windows XP

2005-07-21 Thread Tony Lewis
PoWah Wong wrote: 

> The login page is:
> http://safari.informit.com/?FPI=&uicode=
>
> How to figure out the login command?
>
> These two commands do not work:
>
> wget --save-cookies cookies.txt "http://safari.informit.com/?FPI= [snip]"
> wget --save-cookies cookies.txt
"http://safari.informit.com/?FPI=&uicode=/login.php? [snip]"

When trying to recreate a form in wget, you have to send the data the server
is expecting to receive to the location the server is expecting to receive
it. You have to look at the login page for the login form and recreate it.
In your browser, view the source to http://safari.informit.com/?FPI=&uicode=
and you will find the form that appears below. Note that I stripped out
formatting information for the table that contains the form and reformatted
what was left to make it readable.


  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  


Note that the server expects the data to be posted to JVXSL.asp and that
there are a bunch of fields that must be supplied in order for the server to
process the login request. In addition, the two fields you supply are called
"usr" and "pwd". So your first wget command line will look something like
this:

wget --save-cookies cookies.txt "http://safari.informit.com/JVXSL.asp";
--post-data="s=1&o=1&b=1&t=1&f=1&c=1&u=1&r=&l=1&g=&n=1&d=1&a=0&usr=wong_powa
[EMAIL PROTECTED]&pwd=123&savepwd=1"

Hope that helps!

Tony




RE: connect to server/request multiple pages

2005-07-21 Thread Tony Lewis



Pat Malatack wrote:
> is there a 
way to stay connected, because it seems to me that this takes a decent amount of 
time that could be minimized
 
The following 
command will do what you want:
 

wget "google.com/news" 
"google.com/froogle"
 
Tony


RE: Wget patches for ".files"

2005-08-19 Thread Tony Lewis
Mauro Tortonesi wrote: 

> this is a very interesting point, but the patch you mentioned above uses
the
> LIST -a FTP command, which AFAIK is not supported by all FTP servers.

As I recall, that's why the patch was not accepted. However, it would be
useful if there were some command line option to affect the LIST parameters.
Perhaps something like:

wget ftp://ftp.somesite.com --ftp-list="-a"

Tony




RE: with recursive wget status code does not reflect success/failure of operation

2005-09-19 Thread Tony Lewis
Steven M. Schweda wrote:

> Having the exit codes defined in a central location would make it easy
> to adapt them as needed.  Having to search the code for every instance
> of "return 1" or "exit(2)" would make it too complicated.

It seems to me that the easiest way to deal with exit codes is to have a
single function to set the exit code. For example:

  setexitcode(WGET_EXIT_SUCCESS);
or
  setexitcode(WGET_EXIT_QUOTA_EXCEEDED);

This function should be called any time there is an event that might
influence the exit code and the function can then decide what exit code
should be used based on all calls made prior to the end of program
execution. Not only will such an approach restrict the logic for setting the
error code to one place in the code, it will make OS-specific versions of
the error code (such as what Steven desires for VMS) much easier to
implement.

The biggest challenge will be determining the list of WGET_EXIT_* constants
and the interactions between them that influence the final value of the exit
code.

(Was that worth two cents?)

Tony




RE: with recursive wget status code does not reflect success/failure of operation

2005-09-19 Thread Tony Lewis
Hrvoje Niksic wrote:
 
> But I wonder if that's overengineering at work.

I don't think so. The overarching concern is to do what's "expected". As you
noted elsewhere, on a Unix system, that means exit(0) in the case of success
-- preferably with exit(meaningful_value) otherwise. As I recall this chain
started because of the absence of a meaningful value.

I think the use of a setexitcode function could easily satisfy people in the
Unix world and will greatly simply people adapting wget for other operating
systems.

Reflecting on the exchange that you and Steven just had, I think we also
need at wget_exit function that calls exit with an appropriate value. (That
will allow Steven to further adapt for the VMS environment.) In that case,
exit should only be called by wget_exit.

By the way, when do we start on 2.0? I don't know how much time I will be
able to devote to serious coding, but I'd love to participate as fully as I
can in both the architecture and development.

Tony




RE: Recursive accept/reject and html files?

2005-10-03 Thread Tony Lewis
> Is this currently possible? Am I just overlooking something obvious? 

Have you looked at --level?

PS) Questions about how wget works should be sent to wget@sunsite.dk
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Sunday, October 02, 2005 6:11 PM
To: [EMAIL PROTECTED]
Subject: Recursive accept/reject and html files?

In the manual for Wget, I see the following about --accept and --reject:

"Note that these two options do not affect the downloading of html files;
Wget must load all the htmls to know where to go at all-recursive retrieval
would make no sense otherwise."

This makes sense to an extent, because I want to download an html file to
get links from, but it doesn't seem to allow me to prevent it from following
further links to html files while still downloading other files from that
original page.

For example, I want to grab all files of a specific type from an "Index
of..." style file list. The list itself is an html page, and I want to grab
all mp3 files, but I don't want to grab the "Parent Directory" and different
sortig modes for the current directory, which are all html files.

I only want to download one html file, but still follow links from it to all
mp3 files.

Is this currently possible? Am I just overlooking something obvious?

Thanks.




RE: possible bug/addition to wget

2005-10-03 Thread Tony Lewis
wget (currently) compares the following things between the server and local
file system: file name, file size, and modification time. If all three match
and the --timestamping option is used, the file will not be downloaded from
the server. None of those things will match in the case you mention.

It seems that you're suggesting assuming that X.tar is a match for X.tar.gz
regardless of file size and modification time. If that's the case, when
would wget ever know that it should download X.tar.gz because it has
changed?

I can see two ways to approach the feature you're suggesting: 1) if a local
file exists without the .gz extension, gzip the local file and compare just
the size of the two files; 2) download the file from the server, unzip it,
compare the result, if the same, delete the downloaded file. It seems to me
that number 2 is the only safe way to do it, but it's probably not what
you're looking for.

If you want to have a go at this, look for opt.timestamping in ftp.c.

Hope that helps!

-Original Message-
From: bob stephens [contr] [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 30, 2005 10:06 AM
To: [EMAIL PROTECTED]
Subject: possible bug/addition to wget

Hi WGet folks,

This isnt really a bug I found in the operation of wget, but I think it is a
functionality problem.

I wonder if you can help me. I would like to use wget to mirror an ftp site
- this step seems easy.
BUT, I would like to set it up so that the files on my end are un- gzipped.
When I did this step, of course it goes and looks for the gzipped file,
doesnt find it and then goes ahead and redownloads the file. Is there a way
to make it smart enough to see that the file exists in an alternate form.
I can fix it if you can give me a clue as to where to look in the code - in
the compare files section of ftp.c it looks like it does the comparison
using strcmp - but maybe it could be changed to say "if the file has the
extension gz (or Z or zip), then look for the file without the extension
too.

Does this sound dooable - it seems to me like it would be a useful
modification. I was hoping I wouldnt have to get the .listing file and make
the modifications in there.
Thanks alot,

Bob





RE: wget can't handle large files

2005-10-18 Thread Tony Lewis
Eberhard Wolff wrote: 

> Apparently wget can't handle large file.
[snip]
> wget --version GNU Wget 1.8.2

This "bug" was fixed in version 1.10 of wget. You should obtain a copy of
the latest version, 1.10.2.

Tony




RE: wget and international characters (ascii > 127)

2005-10-19 Thread Tony Lewis
Hrvoje Niksic wrote: 

> How should Wget know what is the encoding of the file system?

Isn't it possible for wget to make some determination about the file system
when configure runs? Doesn't it know whether the build is for Unix, Windows,
etc.? If so, can that information be used to handle characters greater than
127 (on an OS-by-OS basis)?

Just wondering...

Tony




RE: wget and international characters (ascii > 127)

2005-10-19 Thread Tony Lewis
Hrvoje Niksic wrote: 

> A program can be built on one machine and run on one or more others.

On one machine, yes, but can it be built on one architecture and run on
another? Will the Unix version run on Windows?

> I'm not aware of a way to reliably determine the coding of a file system.

Neither am I, but my thinking was that a version of wget built on "Unix"
(whatever that means) could use "Unix" rules for file names.

> (Other than the simple Windows vs. Unix, although even that fails
> in presence of SMB shares, Cygwin, and the like.)

Isn't it the responsibility of the software that provides the multi-platform
mapping (Samba, for instance) to handle the differences in file names?

Hmmm... is there anything we can learn from the Samba project about
file-name encoding?

Tony




RE: bug retrieving embedded images with --page-requisites

2005-11-09 Thread Tony Lewis
Jean-Marc MOLINA wrote:

> For example if a PNG image is generated using a "gen_png_image.php" PHP
> script, I think wget should be able to download it if the option
> "--page-requisites" is used, because it's part of the page and it's not
> an external resource, get its MIME type, "image/png", and using the
> option "--convert-links" should also rename the script-image to
> "gen_png_image.png".

The --convert-links option changes the website path to a local file system
path. That is, it changes the directory, not the file name. IMO, your
suggestion has merit, but it would require wget to maintain a list of MIME
types and corresponding renaming rules.

Tony




RE: Error connecting to target server

2005-11-11 Thread Tony Lewis
[EMAIL PROTECTED] wrote:

> Thanks for your reply. Only ping works for bbc.com and not wget.

When I issue the command "wget www.bbc.com", it successfully downloads the
following file:




http://www.bbc.co.uk/?ok";>
British Broadcasting Corporation 



 


You might want to try "wget http://www.bbc.co.uk";.

I think http://www.gnu.org/software/wget/faq.html should have another
question: "Why did my download fail and how can I get it to work?" In the
answer to that question we should mention all the common failure modes:
disallowed by robots.txt, need to set user agent to look like a browser,
META refresh (as above), etc. along with the command line options to resolve
the failure.

Also, perhaps the next version of wget can handle META refresh.

Tony




RE: spaces in pathnames using --directory-prefix=prefix

2005-11-30 Thread Tony Lewis
Jonathan DeGumbia wrote:
 
> I'm trying to use the --directory-prefix=prefix option for wget on a
> Windows system.  My prefix has spaces in the path directories.  Wget
> appears to terminate the path at the first space encountered.   In other
> words if my prefix is: c:/my prefix/   then wget copies files to c:/my/ .
>
> Is there a work-around for this?

wget is not terminating the path at the command line delimiter, Windows is.
In the same way that you have to enter:
dir "c:\my prefix"
to list the contents of the directory, you have to enter:
wget --directory-prefix="c:/my prefix"
or the command processor will split the directory path at the space before
passing it to wget.

Tony




RE: wget with a log database?

2005-12-01 Thread Tony Lewis
Mauro Tortonesi wrote: 

> Juhana Sadeharju wrote:
> > Hello. I would like to have a database within wget. 
>
> this is an interesting feature. anyone else would like it to be included
in wget?

Yes; I think wget could be made into a much more robust and useful
application if it could keep track of session information in a database.

Tony




RE: wget 1.10.x fixed recursive ftp download over proxy

2006-01-09 Thread Tony Lewis



I believe the following simplified code would have the same 
effect:
if ((opt.recursive || opt.page_requisites || opt.use_proxy) && url_scheme 
(*t) != SCHEME_FTP)  status = retrieve_tree (*t);else  
status = retrieve_url   (*t, &filename, &redirected_URL, NULL, 
&dt);
Tony



From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED] On Behalf Of CHEN PengSent: 
Monday, January 09, 2006 12:38 AMTo: 
[EMAIL PROTECTED]Subject: wget 1.10.x fixed recursive ftp download 
over proxy
Hi,
We once encounter an annoying problem of recursively downloading FTP data 
using wget, through a ftp-over-http proxy. Previously it was the proxy firmware 
that does not support recursive downloads, but even upgrading we realized there 
is problem with wget itself as well. 
We found that with new proxy firmwire, the older wget 1.7.x can download FTP 
database recursively, but the newer version (1.9.x and 1.10.x) can not. That 
means there must be something wrong with the code.
I also confirmed this is a known bug for wget since 2003 and it is strange it 
has not been fixed for a long time.
To fix this problem, I took some time to analyze its code and it happens wget 
uses different method to get the list of files for a destination folder when 
trying to do recursive download. For normal FTP, it uses FTP command "LIST" to 
get the file listing. For normal HTTP, it uses its internal method 
"retrieve_tree()" to generate the lists. 
In main.c, it does to use retrieve_tree() function to generate list if the 
traffic is FTP. Howerver, when we use ftp-over-http proxy, the actual request to 
the server is HTTP request, where the "LIST" FTP command wont work, so we only 
get one "index.html" file.
if ((opt.recursive || opt.page_requisites) && 
url_scheme (*t) != SCHEME_FTP)  status = retrieve_tree 
(*t);else  status = retrieve_url   (*t, &filename, 
&redirected_URL, NULL, &dt);
In this scenario, we need to modify the code to force wget call retrieve_tree 
function for FTP traffic if the proxy is involved 
if ((opt.recursive || opt.page_requisites)//  && url_scheme (*t) != 
SCHEME_FTP)&& ((url_scheme (*t) != 
SCHEME_FTP) ||     (opt.use_proxy 
&& url_scheme (*t) == SCHEME_FTP)))  status = 
retrieve_tree (*t);else  status = retrieve_url   (*t, 
&filename, &redirected_URL, NULL, &dt);
After patching the main.c, the new wget works perfectly for FTP recursive 
downloading, both with proxy and without proxy. This patching works for 1.9.x 
and 1.10.x till the latest version so far (1.10.2).-- CHEN Peng <[EMAIL PROTECTED]> 



RE: wget 1.10.x fixed recursive ftp download over proxy

2006-01-10 Thread Tony Lewis
Here's what your version of the code said:

if ((opt.recursive || opt.page_requisites)
&& ((url_scheme (*t) != SCHEME_FTP) || 
(opt.use_proxy && url_scheme (*t) == SCHEME_FTP)))

which means (for the bit after the &&):

FTP  ^FTP
proxyT T
^proxy   F T

Regardless of whether it is FTP, the condition will always succeed if
use_proxy is true. Therefore, a simpler way of writing the expression is:
(url_scheme (*t) != SCHEME_FTP) || opt.use_proxy

You're right that I shouldn't have moved opt.use_proxy with the other
command line options. My revised suggestion is:

if ((opt.recursive || opt.page_requisites)
&& ((url_scheme (*t) != SCHEME_FTP) || opt.use_proxy))
  status = retrieve_tree (*t);
else
  status = retrieve_url
  (*t, &filename, &redirected_URL, NULL, &dt);

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
CHEN Peng
Sent: Tuesday, January 10, 2006 5:06 PM
To: Tony Lewis
Cc: [EMAIL PROTECTED]
Subject: Re: wget 1.10.x fixed recursive ftp download over proxy

Your simplified code may not work. The intention of patching is to make wget
invoke "retrieve_tree" funtion when it IS "FTP" and uses proxy, while your
code works when it is NOT FTP and uses proxy.


On 1/10/06, Tony Lewis <[EMAIL PROTECTED]> wrote:
>
> I believe the following simplified code would have the same effect:
>
> if ((opt.recursive || opt.page_requisites || opt.use_proxy) && 
> url_scheme (*t) != SCHEME_FTP)
>   status = retrieve_tree (*t);
> else
>   status = retrieve_url
>   (*t, &filename, &redirected_URL, NULL, &dt);
>
>
> Tony
>

--
CHEN Peng <[EMAIL PROTECTED]>



RE: wget option (idea for recursive ftp/globbing)

2006-03-02 Thread Tony Lewis
Mauro Tortonesi wrote: 

> i would like to read other users' opinion before deciding which
> course of action to take, though.

Other users have suggested adding a command line option for "-a" two or
three times in the past:

- 2002-11-24: Steve Friedl <[EMAIL PROTECTED]> submitted a patch
- 2002-12-24: Maaged Mazyek <[EMAIL PROTECTED]> submitted a patch
- 2005-05-09: B Wooster <[EMAIL PROTECTED]> asked if the fix was ever
going to be implemented
- 2005-08-19: Carl G. Ponder <[EMAIL PROTECTED]> asked if the patches
were going to be applied
- 2005-08-20: Hrvoje responded by posting his own patch for --list-options

(and that's just what I can find in my local archive searching for "list
-a")

There is clearly a need among the user community for a feature like this and
lots of ideas about how to implement it. I'd say you should pick one and
implement it.

If you need copies of any of the patches mentioned in the list above, let me
know.

Tony



RE: Bug in ETA code on x64

2006-03-28 Thread Tony Lewis
Hrvoje Niksic wrote:

> The cast to int looks like someone was trying to remove a warning and
> botched operator precedence in the process.

I can't see any good reason to use "," here. Why not write the line as:
  eta_hrs = eta / 3600; eta %= 3600;

This makes it much less likely that someone will make a coding error while
editing that section of code.

Tony



RE: regex support RFC

2006-03-30 Thread Tony Lewis
How many keywords do we need to provide maximum flexibility on the
components of the URI? (I'm thinking we need five.)

Consider http://www.example.com/path/to/script.cgi?foo=bar

--filter=uri:regex could match against any part of the URI
--filter=domain:regex could match against www.example.com
--filter=path:regex could match against /path/to/script.cgi
--filter=file:regex could match against script.cgi
--filter=query:regex could match against foo=bar

I think there are good arguments for and against matching against the file
name in "path:"

Tony



RE: regex support RFC

2006-03-30 Thread Tony Lewis
Curtis Hatter wrote:

> Also any way to add modifiers to the regexs? 

Perhaps --filter=path,i:/path/to/krs would work.

Tony



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Mauro Tortonesi wrote: 

> no. i was talking about regexps. they are more expressive
> and powerful than simple globs. i don't see what's the
> point in supporting both.

The problem is that users who are expecting globs will try things like
--filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their
expressions will simply work, which will result in significant confusion
when some expression doesn't work, such as
--filter:-domain:www-*.yoyodyne.com. :-)

It is pretty easy to programmatically convert a glob into a regular
expression. One possibility is to make glob the default input and allow
regular expressions. For example, the following could be equivalent:

--filter:-domain:www-*.yoyodyne.com
--filter:-domain,r:www-.*\.yoyodyne\.com

Internally, wget would convert the first into the second and then treat it
as a regular expression. For the vast majority of cases, glob will work just
fine.

One might argue that it's a lot of work to implement regular expressions if
the default input format is a glob, but I think we should aim for both lack
of confusion and robust functionality. Using ",r" means people get regular
expressions when they want them and know what they're doing. The universe of
wget users who "know what they're doing" are mostly subscribed to this
mailing list; the rest of them send us mail saying "please CC me as I'm not
on the list". :-)

If we go this route, I'm wondering if the appropriate conversion from glob
to regular expression should take directory separators into account, such
as:

--filter:-path:path/to/*

becoming the same as:

--filter:-path,r:path/to/[^/]*

or even:

--filter:-path,r:path[/\\]to[/\\][^/\\]*

Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.)

Tony



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote: 

> But that misses the point, which is that we *want* to make the
> more expressive language, already used elsewhere on Unix, the
> default.

I didn't miss the point at all. I'm trying to make a completely different
one, which is that regular expressions will confuse most users (even if you
tell them that the argument to --filter is a regular expression). This
mailing list will get a huge number of bug reports when users try to use
globs that fail.

Yes, regular expressions are used elsewhere on Unix, but not everywhere. The
shell is the most obvious comparison for user input dealing with expressions
that select multiple objects; the shell uses globs.

Personally, I will be quite happy if --filter only supports regular
expressions because I've been using them quite effectively for years. I just
don't think the same thing can be said for the typical wget user. We've
already had disagreements in this chain about what would match a particular
regular expression; I suspect everyone involved in the conversation could
have correctly predicted what the equivalent glob would do.

I don't think ",r" complicates the command that much. Internally, the only
additional work for supporting both globs and regular expressions is a
function that converts a glob into a regexp when ",r" is not requested.
That's a straightforward transformation.

Tony



RE: regex support RFC

2006-03-31 Thread Tony Lewis
Hrvoje Niksic wrote:

> I don't see a clear line that connects --filter to glob patterns as used
> by the shell.

I want to list all PDFs in the shell, ls -l *.pdf

I want a filter to keep all PDFs, --filter=+file:*.pdf

Note that "*.pdf" is not a valid regular expression even though it's what
most people will try naturally. Perl complains:
/*.pdf/: ?+*{} follows nothing in regexp

I predict that the vast majority of bug reports and support requests will be
for users who are trying a glob rather than a regular expression.

Tony



RE: download of images linkes in css does not work

2006-04-13 Thread Tony Lewis
It's not a bug; it's a (missing) feature. 
-Original Message-
From: Detlef Girke [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 13, 2006 3:17 AM
To: [EMAIL PROTECTED]
Subject: download of images linkes in css does not work

Hello,
I tried everything, but images, built in via CSS are neither downloaded nor
related with wget.
Example (inline style):
CSS-Terms like



do not have any effect on the downloaded web-page.

The same thing happens, when you write

{background-image : url(/files/inc/image/pjpeg/hintergrund_startseite.jpg);}

into a css-file.

Maybe other references in CSS do not work either. Perhaps you can prove
this.

If you could fix this problem, wget would be the best tool for me.
Thank you and best regards
Detlef


--
Detlef Girke, BIK Hamburg, Beratung, Tests und Workshops c/o DIAS GmbH,
Neuer Pferdemarkt 1, 20359 Hamburg [EMAIL PROTECTED],
www.bik-online.info, 040 43187513,Fax 040 43187519



RE: dose wget auto-convert the downloaded text file?

2006-04-16 Thread Tony Lewis
"18 mao" <[EMAIL PROTECTED]> wrote:

> then  save the page as 2.html with the FireFox browser

You should not assume that the file saved by any browser is the same as the
file delivered to the browser by the server. The browser is probably
manipulating line endings to match the conventions on your operating system
when it saves files so that CR, CR-LF, or LF, all become CR-LF (or whatever
your OS uses for line endings).

Tony



RE: Windows Title Bar

2006-04-18 Thread Tony Lewis
Hrvoje Niksic wrote:

> Anyway, adding further customizations to an already questionnable feature
> is IMHO not a very good idea. 

Perhaps Derek would be happy if there were a way to turn off this
"questionable feature".

Tony



RE: Defining "url" in .wgetrc

2006-04-20 Thread Tony Lewis
ks wrote: 

> Just one more question.
> Something like this inside "somefile.txt"
>
> http://fly.srk.fer.hr/
> -r http://www.gnu.org/ -o gnulog
> -S http://www.lycos.com/

Why not use a batch file or command script (depending on what OS you're
using) containing something like:

wget http://fly.srk.fer.hr
wget -r http://www.gnu.org -o gnulog
wget -S http://www.lycos.com

Tony



RE: wget www.openbc.com post-data/cookie problem

2006-05-04 Thread Tony Lewis
Erich Steinboeck wrote:

> Is there a way to trace the browser traffic and compare
> that to the wget traffic, to see where they differ.

You can use a web proxy. I like Achilles:
http://www.mavensecurity.com/achilles 

Tony



RE: I cannot get the images

2006-05-15 Thread Tony Lewis
The problem is your accept list; -A*.* says to accept any file that contains
at least one dot in the file name and
GetFile?id=DBJOHNUNZIOCSBMOMKRU&convert=image%2Fgif&scale=3 doesn't contain
any dots.

I think you want to accept all files so just delete -A*.* from your argument
list because the default behavior is to accept everything.

Tony
-Original Message-
From: matis [mailto:[EMAIL PROTECTED] 
Sent: Monday, May 15, 2006 6:09 AM
To: wget@sunsite.dk
Subject: I cannot get the images

Hi,
Im trying to get whole directory but images from database are ignored. If
you paste the address below this post to the browser (or even flashget) it
will download the image and open with a default extension .gif . But wget
informs "file should be removed" and then remove it :/ . As effect when
there's a picture on every page (with the address as below) only empty htmls
are downloaded. Does anybody knows what to do?

The address (with the wget command used by me):
wget --cache=off -p -m -erobots=off -t10 -v -A*.*
"http://alo.uibk.ac.at:80/filestore/servlet/GetFile?id=DBJOHNUNZIOCSBMOMKRU
&convert=image/gif&scale=3"
whole html address (broken):
http://www.literature.at/webinterface/library/ALO-
BOOK_V01?objid=13017&page=3&zoom=3

regards
matis



RE: Batch files in DOS

2006-06-05 Thread Tony Lewis
I think there is a limit to the number of characters that DOS will accept on
the command line (perhaps around 256). Try putting echo in front of the
command in your batch file and see how much of it gets echoed back to you.
As Tobias suggested, you can try moving some of your command line options
into the .wgetrc file.

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Saturday, June 03, 2006 2:46 PM
To: wget@sunsite.dk
Subject: Batch files in DOS

I'm trying to mirror about 100 servers (small fanfic sites) using wget
--recursive --level=inf -Dblah.com, blah.com,blah.com some_address However,
when I run the batch file, it stops reading after a while; apparently my
command has too many characters.  Is there some other way I should be doing
this, or a workaround?

GNU Wget 1.10.1 running on Windows 98

-- 

http://www.aericanempire.com/



RE: wget - tracking urls/web crawling

2006-06-22 Thread Tony Lewis
Bruce wrote: 

> if there was a way that i could insert/use some form of a regex to exclude
> urls+querystring that match, then i'd be ok... the pages i need to 
> urls+exclude
> are based on information that's in the query portion of the url...

Work on such a feature has been promised for an upcoming release of wget.

Tony Lewis



RE: wget - tracking urls/web crawling

2006-06-23 Thread Tony Lewis
Bruce wrote: 

> any idea as to who's working on this feature?

Mauro Tortonesi sent out a request for comments to the mailing list on March
29. I don't know whether he has started "working" on the feature or not.

Tony



RE: BUG

2006-07-03 Thread Tony Lewis
Title: RE: BUG






Run the command with -d and post the output here.


Tony

_ 

From:   Junior + Suporte [mailto:[EMAIL PROTECTED]] 

Sent:   Monday, July 03, 2006 2:00 PM

To: [EMAIL PROTECTED]

Subject:    BUG


Dear,


I using wget to send login request to a site, when wget is saving the cookies, the following error message appear:


Error in Set-Cookie, field `Path'Syntax error in Set-Cookie: tu=661541|802400391

@TERRA.COM.BR; Expires=Thu, 14-Oct-2055 20:52:46 GMT; Path= at position 78.

Location: http://www.tramauniversitario.com.br/servlet/login.jsp?username=802400

391%40terra.com.br&pass=123qwe&rd=http%3A%2F%2Fwww.tramauniversitario.com.br%2Ft

uv2%2Fenquete%2Fcb%2Fsul%2Farte.jsp [following]


I trying to access URL http://www.tramauniversitario.com.br/tuv2/participe/login.jsp?rd=http://www.tramauniversitario.com.br/tuv2/enquete/cb/sul/arte.jsp&[EMAIL PROTECTED]&pass=123qwe&Submit.x=6&Submit.y=1

In Internet Explorer, this URL work correctly and the cookie is saved in the local machine, but in WGET, this cookie return an error. 

Thanks,


Luiz Carlos Zancanella Junior





RE: wget 403 forbidden error when no index.html.

2006-07-07 Thread Tony Lewis
You seriously expected the server to provide wget with a file when it
returned 403 to the browser?

wget must be provided with a valid URL before it can do anything. If you
want to download something from the server, figure out how to retrieve it in
your browser and then provide that URL to wget.

Tony
-Original Message-
From: news [mailto:[EMAIL PROTECTED] On Behalf Of Aditya Joshi
Sent: Friday, July 07, 2006 9:15 AM
To: wget@sunsite.dk
Subject: wget 403 forbidden error when no index.html.


I am trying to download a specific directory contents of a site and i kep
getting the 403 forbidden when i run wget. The direcotry does not have an
index.html and ofcourse any refrences to that path result a 403 page
displayed in my browser. Is this why wget is not working. If so how to
download contents of such sites.



RE: wget - Returning URL/Links

2006-07-10 Thread Tony Lewis
Mauro Tortonesi wrote:

> perhaps we should modify wget in order to print the list of "touched"
> URLs as well? maybe only in case -v is given? what do you think?

On June 28, 2005, I submitted a patch to write unfollowed links to a file.
It would be pretty simple to have a similar --followed-links option.

Tony



RE: I got one bug on Mac OS X

2006-07-15 Thread Tony Lewis



I don't think that's valid HTML. According to RFC 
1866: An HTML user 
agent should treat end of line in any of its variations as a word space in all 
contexts except preformatted text.
I 
don't see any provision for end of line within the HREF attribute of an A 
tag.
 
Tony


From: HUAZHANG GUO [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 11, 2006 7:48 AMTo: 
[EMAIL PROTECTED]Subject: I got one bug on Mac OS 
X

Dear Sir/Madam,
while I was trying to download 
using the command:

wget -k -np -r 
-l inf -E http://dasher.wustl.edu/bio5476/

I got most of the files, but lost some of them.

I think I know where the problem is:

if the link is broken into two lines in the index.html:

Lecture 1 (Jan 17): Exploring Conformational Space for Biomolecules http://dasher.wustl.edu/bio5476/lectures /lecture-01.pdf">[PDF]

I will get the following error message: --09:13:16-- http://dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf => `/Users/hguo/mywww//dasher.wustl.edu/bio5476/lectures%0A/lecture-01.pdf' Connecting to dasher.wustl.edu[128.252.208.48]:80... connected. HTTP request sent, awaiting response... 404 Not Found 09:13:16 ERROR 404: Not Found. Please note that wget adds a special charactor '%0A' in the URL. Maybe the Windows new line have one more charactor which is not recoganized by Mac wget. I am using Mac OS X, Tigger Darwin. Thanks!

RE: I got one bug on Mac OS X

2006-07-16 Thread Tony Lewis
Hrvoje Niksic wrote:

> HTML has been maintained by W3C for many years 

I knew that (but forgot) -- just went to ietf.org out of habit looking for
Internet specifications.

Tony



RE: "referer"

2006-09-19 Thread Tony Lewis
[EMAIL PROTECTED] wrote: 

> Maybe "referer" has become established, but it really is a misspelling

You need to take that up with the Internet Engineering Task Force. For what
it's worth, that misspelling has been a part of HTTP since the early 1990s.

Tony



RE: Annyoing behaviour with --input-file

2006-09-29 Thread Tony Lewis
Speaking of annoying behavior... Did you send this from the future? Perhaps
we can have someone look into this in eight years. :-)

That's how long it's going to be sitting at the top of my wget folder. :-(

Tony
-Original Message-
From: Adam Klobukowski [mailto:[EMAIL PROTECTED] 
Sent: Sunday, July 13, 2014 4:36 AM
To: wget@sunsite.dk
Subject: Annyoing behaviour with --input-file

If wget is used with --input-file option, it gets directory listing for each
file specified in input file (if ftp protocol) before downloading each file,
which is quite annyoying if there are few thousand of small files in the
filelist, and every directory listing is way longer then any file, in other
words: overhead is to big to be reasonable.

--
Semper Fidelis

Adam Klobukowski
[EMAIL PROTECTED]



RE: wget question (connect multiple times)

2006-10-17 Thread Tony Lewis
A) This is the list for reporting bugs. Questions should go to
wget@sunsite.dk

B) wget does not support "multiple time simultaneously"

C) The decreased per-file download time you're seeing is (probably) because
wget is reusing its connection to the server to download the second file. It
takes some time to set up a connection to the server regardless of whether
you're downloading one byte or one gigabyte of data. For small files, the
set up time can be a significant part of the overall download time.

Hope that helps!

Tony
-Original Message-
From: t u [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 17, 2006 3:50 PM
To: [EMAIL PROTECTED]
Subject: wget question (connect multiple times)

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

hi,
I hope it is okay to drop a question here.

I recently found that if wget downloads one file, my download speed will be
Y, but if wget downloads two separate files (from the same server, doesn't
matter), the download speed for each of the files will be Y (so my network
speed will go up to 2 x Y).

So my question is, can I make wget download the same file "multiple times
simultaneously"? In a way, it would run as multiple processes and download
parts of the file at the same time, speeding up the download.

Hope I could explain my question, sorry about the bad english.

Thanks

PS. Please consider this as an enhancement request if wget cannot get a file
by downloading parts of it simultaneously.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)

iD8DBQFFNV4YLM1JzWwJYEYRAsEEAJ9FTx+hURJD5VudhbN2f7Iight3AACcDa6f
tO3WuBYygfKLA2Pis8Fbcos=
=7kNq
-END PGP SIGNATURE-



RE: SI units

2007-01-15 Thread Tony Lewis
Lars Hamren wrote: 

> Download speeds are reported as "K/s", where, I assume, "K" is short for
"kilobytes".
>
> The correct SI prefix for thousand is "k", not "K":
>
>http://physics.nist.gov/cuu/Units/prefixes.html


SI units are for decimal-based numbers (that is powers of 10) whereas
computer programs typically use binary-based numbers (powers of 2). It's
convenient for humans to equate 10^3 (1,000) with 2^10 (1,024) but with
large numbers, these values quickly diverge: 999k or 999 * 10^3 = 999,000,
but 999K or 999 * 2^10 = 1,022,976.

For what it's worth, according to Wikipedia either k or K is acceptable for
1024:
  http://en.wikipedia.org/wiki/Binary_prefix



RE: SI units

2007-01-15 Thread Tony Lewis
Christoph Anton Mitterer wrote: 

> I don't agree with that,.. SI units like K/M/G etc. are specified by
> international standards and those specify them as 10^x.
>
> The IEC defined in IEC 60027 symbols for the use with base 2 (e.g. Ki, Mi,
Gi)

All of this is described in the Wikipedia article I referenced.

It's true that International Electrotechnical Commission prefers the term
kibibytes and the prefix Ki for 1,024, but it's still not a term commonly
used in computer standards.

Searching ietf.org there are 1,880 matches for kilobytes and only 2 for
kibibytes and those are both feedback from one individual arguing for the
use of kibibytes instead of kilobytes.

Searching gnu.org there are 452 matches for kilobytes and only 5 for
kibibytes and even then, the following appears:  `KiB' kibibyte: 2^10 =
1024. `K' is special: the SI prefix is `k' and the IEC 60027-2 prefix is
`Ki', but tradition and POSIX use `k' to mean `KiB'.

It seems odd to me that one would suggest that wget is the place to start
changing the long-established trend of using 'k' for 1,024.



RE: php form

2007-02-21 Thread Tony Lewis
Look for 
 
action tells you where the form fields are sent.
 
method tells you if the server is expecting the data to be sent using a GET
or POST command; GET is the default. In the case of GET, the arguments go
into the URL. If method is POST, follow the instructions in the manual.
 
Hope that helps.
 
Tony

  _  

From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 21, 2007 4:39 PM
To: wget@sunsite.dk
Subject: php form


There is a database on a web server (to which I have access) that is
accessible via username/password.  The only way for users to access the
database is to use a form with search criteria and then press a button that
starts a php script that produces a web page with the results of the search.
 
I have a couple of questions:
 
1.  Is there any easy way to know exactly what commands are behind the
button, to duplicate them?
 
2.  If so, then do I just use the POST command as described in the manual,
after logging in (per the manual), to get the data it provides.  
 
I have used wget just a little, but I am completely new to php.  
 
Thanks, Alan
 
 
 


RE: php form

2007-02-22 Thread Tony Lewis
The table stuff just affects what's shown on the user's screen. It's the
input field that affects what goes to the server; in this case, that's
 so you want to post country=US. If there were
multiple fields, you would separate them with ampersands such as
country=US&state=CA.
 
Tony

  _  

From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 22, 2007 5:27 PM
To: Tony Lewis; wget@sunsite.dk
Subject: Re: php form


Tony,
Thanks.  I have to log in with username/password, and I think I
know how to do that with wget using POST.  For the actual search page, the
HTML source says it`s: 
 

 
However, I`m not clear on how to convey the data for the search.  
 
The search for has defined a table.  One of the entries, for example, is:
 

  Search by Country:
  

 
If I want to use wget to search for entries in the U.S. ("US"), then how do
I convey this when I post to the php?
 
Thanks, Alan 

----- Original Message - 
From: Tony Lewis <mailto:[EMAIL PROTECTED]>  
To: 'Alan Thomas' <mailto:[EMAIL PROTECTED]>  ; wget@sunsite.dk 
Sent: Thursday, February 22, 2007 12:53 AM
Subject: RE: php form

Look for 
 
action tells you where the form fields are sent.
 
method tells you if the server is expecting the data to be sent using a GET
or POST command; GET is the default. In the case of GET, the arguments go
into the URL. If method is POST, follow the instructions in the manual.
 
Hope that helps.
 
Tony

  _  

From: Alan Thomas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 21, 2007 4:39 PM
To: wget@sunsite.dk
Subject: php form


There is a database on a web server (to which I have access) that is
accessible via username/password.  The only way for users to access the
database is to use a form with search criteria and then press a button that
starts a php script that produces a web page with the results of the search.
 
I have a couple of questions:
 
1.  Is there any easy way to know exactly what commands are behind the
button, to duplicate them?
 
2.  If so, then do I just use the POST command as described in the manual,
after logging in (per the manual), to get the data it provides.  
 
I have used wget just a little, but I am completely new to php.  
 
Thanks, Alan
 
 
 



RE: how to get images into a new directory/filename heirarchy? [GishPuppy]

2007-02-23 Thread Tony Lewis
If it were me, I'd grab all the files to my local drive and then write
scripts to do the moving and renaming.

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 23, 2007 1:33 AM
To: wget@sunsite.dk
Subject: how to get images into a new directory/filename heirarchy?
[GishPuppy]

Hi,

I'm trying to use wget to download 100s of JPGs into a cache server with a
different directory/filename heirarchy. What I tried to do was to create a
text or html file with 1 line for each download (e.g. URL -nd -P [new-path]
-O [new-filename]) and use the --input-file= switch, However, I discovered
that I cannot rename the path/filename of the file inside the input file.

Also, the JPGs will not all come from the same domain but they need to be
placed in a flattened directory tree with different filenames.

Can anyone offer me advice on how to best accomplish this? I'm using the
windows platform.

m.

Gishpuppy | To reply to this email, click here: 
http://www.gishpuppy.com/cgi-bin/[EMAIL PROTECTED]



RE: wget help on file download

2007-03-01 Thread Tony Lewis
The server told wget that it was going to return 6K:
 
Content-Length: 6720
  _  

From: Smith, Dewayne R. [mailto:[EMAIL PROTECTED] 
Sent: Thursday, March 01, 2007 8:05 AM
To: [EMAIL PROTECTED]
Subject: wget help on file download


Trying to download a 4mb file. it only retrieves 6k of it.
I've tried without the added --options and it doesn't work.
 
Can you see any issues below?
Thx!
 
 
C:\Backup_CD\WGET>wget -dv -S --no-http-keep-alive  --ignore-length
--secure-protocol=auto --no-check-certificate  https://
server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/65dc9ed71
3ae030f85256f31006eb413/$FILE/TR%202004.018%20AEGIS%20TEST%20PLAN..pdf
Setting --verbose (verbose) to 1
Setting --server-response (serverresponse) to 1
Setting --http-keep-alive (httpkeepalive) to 0
Setting --ignore-length (ignorelength) to 1
Setting --secure-protocol (secureprotocol) to auto
Setting --check-certificate (checkcertificate) to 0
DEBUG output created by Wget 1.10.2 on Windows.
 
--11:01:08--

https://server2.csci-va.com/siap/siap.nsf/297c783b5c8fa51985256cd700546846/6
5dc9ed713ae030f85256f31006eb413/$F
ILE/TR%202004.018%20AEGIS%20TEST%20PLAN..pdf
   => `TR 2004.018 AEGIS TEST PLAN..pdf.4'
Resolving server2.csci-va.com... seconds 0.00, 65.207.33.26
Caching server2.csci-va.com => 65.207.33.26
Connecting to server2.csci-va.com|65.207.33.26|:443... seconds 0.00,
connected.
Created socket 1932.
Releasing 0x00395228 (new refcount 1).
Initiating SSL handshake.
Handshake successful; connected socket 1932 to SSL handle 0x009318c8
certificate:
  subject: /C=US/O=U.S.
Government/OU=ECA/OU=ORC/OU=CSCI/CN=server2.csci-va.com
  issuer:  /C=US/O=U.S. Government/OU=ECA/OU=Certification
Authorities/CN=ORC ECA
WARNING: Certificate verification error for server2.csci-va.com: self signed
certificate in certificate chain
 
---request begin---
GET
/siap/siap.nsf/297c783b5c8fa51985256cd700546846/65dc9ed713ae030f85256f31006e
b413/$FILE/TR%202004.018%20AEGIS%20TEST%20PL
AN..pdf HTTP/1.0
User-Agent: Wget/1.10.2
Accept: */*
Host: server2.csci-va.com
 
---request end---
HTTP request sent, awaiting response...
---response begin---
HTTP/1.1 200 OK
Server: Lotus-Domino
Date: Thu, 01 Mar 2007 15:57:55 GMT
Connection: close
Expires: Tue, 01 Jan 1980 06:00:00 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 6720
Pragma: no-cache
 
---response end---
 
  HTTP/1.1 200 OK
  Server: Lotus-Domino
  Date: Thu, 01 Mar 2007 15:57:55 GMT
  Connection: close
  Expires: Tue, 01 Jan 1980 06:00:00 GMT
  Content-Type: text/html; charset=UTF-8
  Content-Length: 6720
  Pragma: no-cache
Length: ignored [text/html]
 
[ <=>
] 6,720 --.--K/s
 
Closed 1932/SSL 0x9318c8
11:01:08 (309.48 KB/s) - `TR 2004.018 AEGIS TEST PLAN..pdf.4' saved [6720]
 

C:\Backup_CD\WGET>
 
Dewayne R. Smith 

SPAWAR Systems Center Charleston 

Code 613, Special Projects Branch 

Office (843) 218-4393

Mobile (843) 696-9472

 


RE: Huh?...NXDOMAINS

2007-03-23 Thread Tony Lewis
Bruce <[EMAIL PROTECTED]> wrote:

> the hostname 'ga13.gamesarena.com.au' resoles back to an NX domain 

"NXDOMAIN" is short hand for non-existent domain. It means the domain name
system doesn't know the IP address of the domain. (It would be like me
having a non-published telephone number; if you know my number, you can call
me, but it won't do you any good to call directory assistance because they
can't tell you my number.)

If your web browser is able to find the site then it should be possible for
wget to find it too. But, since it's not a straightforward DNS lookup,
you'll have to figure out how your browser is pulling off the magic.

One way to do that is to run with a local proxy (such as Achilles) and study
what happens between your browser and the server. If you compare that with
the debug output of wget, you'll have an idea of where the flow is different
and what wget might do to make it work.

I'm sure someone can point out open-source options for the proxy. :-)

Have fun exploring.

Tony



FW: think you have a bug in CSS processing

2007-03-31 Thread Tony Lewis
From: Neil Smithline [mailto:[EMAIL PROTECTED] 
Sent: Saturday, March 31, 2007 9:44 PM
To: Tony Lewis
Subject: Re: think you have a bug in CSS processing

 

Oh - well if you don't support CSS processing then I guess I am making a new
request. I'm also suggesting a clarification to your documentation so that
this is clear.

-r --recursive

Recursive web-suck. According to the protocol of the URL, this can mean two
things. Recursive retrieval of a HTTP URL means that Wget will download the
URL you want, parse it as an HTML document (if an HTML document it is), and
retrieve the files this document is referring to, down to a certain depth
(default 5; change it with -l). Wget will create a hierarchy of directories
locally, corresponding to the one found on the HTTP server. 


At least at first glance , this seems to mean that the URL in the CSS
portion should be translated and downloaded. When giving it some thought I
think a valid argument could be made that the string in the CSS document is
not exactly an URL but it is certainly URL-like. I think there should be
some explicit documentation stating what is not covered. 

 

- Neil
 

On 3/31/07, Tony Lewis <[EMAIL PROTECTED]> wrote:

 <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED] wrote:

 

> I think I found a bug in CSS processing.

I think you're making a new feature request (and one that we've seen before)
to ADD processing for CSS.

Tony

 

 



RE: Cannot write to auto-generated file name

2007-04-03 Thread Tony Lewis
Vitaly Lomov wrote:

> It's a file system issue on windows: file path length is limited to
> 259 chars.

In which case, wget should do something reasonable (generate an error
message, truncate the file name, etc.). It shouldn't be left as exercise for
the user to figure out that the automatically generated name cannot be used
by the OS. (My vote is to truncate the name, but it's a lot easier to
generate an error message.)

Tony



RE: Suggesting Feature: Download anything newer than...

2007-04-07 Thread Tony Lewis
I don't think there is such a feature, but if you're going to add
--not-before, you might as well add --not-after too.

Tony
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Saturday, April 07, 2007 6:27 PM
To: wget@sunsite.dk
Subject: Suggesting Feature: Download anything newer than...

I'm a very frequent user of wget, but must admit I haven't
dived too deep into various options - but as far as I can
tell, what I'm about to suggest is not a current feature.
If it is, can somebody tell me how to access it?  0:-)

What I'm suggesting is something similar to -N (check
timestamp and download newer) and may perhaps be used more
as a modifier to -N than a seperate option.

I occationally make a mirror of certain site with wget, and
then throw it into an archive.  Unfortuanately, a few months
(year) later when I want to catch-up with any updates, I either
have to mirror the whole thing again or locate the old archive
and unpack it (and I haven't necesserely preserved the whole
directory structure).

What I would love was the ability to specify (through an option)
an arbitrary timestamp (a date... and perhaps time), and for
only files created/modified after this time to be downloaded (e.g.
the approximate time for the creation of my latest archive).

I am envision it as based on the -N option; except that rather
than looking on the time-stamp - or the size or even the
existance - of a local file, it would only compare the remote file's
timestamp to the supplied timestamp - and download if the remote
file was newer.  Of course, it would probably be h*** of a lot worse
to program than just rewriting the -N option.  :-)

It would have to parse links in HTML-files (HTML) or traverse
directories (FTP).

Usually it would be used when no local mirror existed, and then
creating a mirror of just files made after a certain time (it would
of course have to create a dir-structure containing directories
also older than the specified time, but no older files).  However
being able to use it (a specified time) together with the -N or
--mirror option, may also be useful when updating a local mirror
(though I can't actually see when); so perhaps it should be an option
to be used in *companion* with -N (rather than instead of -N)... or
at least let it be *possible* to use it together with -N and --mirror
as well as by itself.

-Koppe



RE: FW: think you have a bug in CSS processing

2007-04-13 Thread Tony Lewis
J.F.Groff wrote:

> Amazingly I found this feature request in a 2003 message to this very
mailing
> list. Are there only a few lunatics like me who think this should be
included?

Wget is written and maintained by volunteers. What you need to find is a
lunatic willing to volunteer to write the code to support this feature
request.

Tony



RE: timestamping not working when -O option is in path/filename format

2007-04-22 Thread Tony Lewis
n g wrote:

> wget url -O dir/name -N
> would download the same file every run.
> 
> while
> wget url -O name -N
> works as expected.

timestamping compares the timestamp on the local file with the timestamp on
the server. When you use -O the timestamp on the local file is the time the
file was downloaded (not the file from the Last-Modified header).

Tony



RE: sending Post Data and files

2007-05-09 Thread Tony Lewis
Lara Röpnack wrote:

 

> 1.) How can I send Post Data with Line Breaks? I can not press enter

> and \n or \r or \r\n dont work...

 

You don’t need a line break because parameters are separated by ampersands;
a=1&b=2


> 2.) I dont understand the post File. I can Send one file - but I cant give

> the name. Normaly I have a Form with a Formelement Input type=file

> name=xy Is it possible to send a File with a name? Is it possible to send

> two files? 

 

On the command line you can use --post-data=”a=1&b=2” or you can put the
data into a file. For example if the file “foo” contains the following
string:

a=1&b=2

you would use --post-file=foo.

 

Currently, it is not possible to send files with wget. It does not support
multipart/form-data.

 

Tony



RE: wget bug

2007-05-24 Thread Tony Lewis
Highlord Ares wrote:

 

> it tries to download web pages named similar to

>  
http://site.com?variable=yes&mode=awesome

 

Since "&" is a reserved character in many command shells, you need to quote
the URL on the command line:

 

wget " 
http://site.com?variable=yes&mode=awesome";

 

Tony

 



RE: Suppressing DNS lookups when using wget, forcing specific IP address

2007-06-18 Thread Tony Lewis
Try: wget http://ip.of.new.sitename --header="Host: sitename.com" --mirror

For example: wget http://66.233.187.99 --header="Host: google.com" --mirror

Tony
-Original Message-
From: Kelly Jones [mailto:[EMAIL PROTECTED] 
Sent: Sunday, June 17, 2007 6:10 PM
To: wget@sunsite.dk
Subject: Suppressing DNS lookups when using wget, forcing specific IP
address

I'm moving a site from one server to another, and want to use "wget
-m" combined w/ "diff -auwr" to help make sure the site looks the same
on both servers.

My problem: "wget -m sitename.com" always downloads the site at its
*current* IP address. Can I tell wget: "download sitename.com, but
pretend the IP address of sitename.com is ip.address.of.new.server
instead of ip.address.of.old.server. In other words, suppress the DNS
lookup for sitename.com and force it to use a given IP address.

I've considered kludges like using "old.sitename.com" vs
"new.sitename.com", editing "/etc/hosts", using a proxy server, etc,
but I'm wondering if there's a clean solution here?

-- 
We're just a Bunch Of Regular Guys, a collective group that's trying
to understand and assimilate technology. We feel that resistance to
new ideas and technology is unwise and ultimately futile.



RE: Question on wget upload/dload usage

2007-06-18 Thread Tony Lewis
Joe Kopra wrote:

 

> The wget statement looks like:

> 

> wget --post-file=serverdata.mup -o postlog -O survey.html

>   http://www14.software.ibm.com/webapp/set2/mds/mds

 

--post-file does not work the way you want it to; it expects a text file
that contains something like this:

a=1&b=2

 

and it sends that raw text to the server in a POST request using a
Content-Type of application/x-www-form-urlencoded. If you run it with -d,
you will see something like this:

 

POST /someurl HTTP/1.0

User-Agent: Wget/1.10

Accept: */*

Host: www.exelana.com

Connection: Keep-Alive

Content-Type: application/x-www-form-urlencoded

Content-Length: 7

 

---request end---

[writing POST file data ... done]

 

To post a file as an argument, you need a Content-Type of
multipart/form-data, which wget does not currently support.

 

Tony



RE: bug and "patch": blank spaces in filenames causes looping

2007-07-05 Thread Tony Lewis
There is a buffer overflow in the following line of the proposed code:

 sprintf(filecopy, "\"%.2047s\"", file);

It should be:

 sprintf(filecopy, "\"%.2045s\"", file);

in order to leave room for the two quotes.

Tony
-Original Message-
From: Rich Cook [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 04, 2007 10:18 AM
To: [EMAIL PROTECTED]
Subject: bug and "patch": blank spaces in filenames causes looping

On OS X, if a filename on the FTP server contains spaces, and the  
remote copy of the file is newer than the local, then wget gets  
thrown into a loop of "No such file or directory" endlessly.   I have  
changed the following in ftp-simple.c, and this fixes the error.
Sorry, I don't know how to use the proper patch formatting, but it  
should be clear.

==
the beginning of ftp_retr:
=
/* Sends RETR command to the FTP server.  */
uerr_t
ftp_retr (int csock, const char *file)
{
   char *request, *respline;
   int nwritten;
   uerr_t err;

   /* Send RETR request.  */
   request = ftp_request ("RETR", file);

==
becomes:
==
/* Sends RETR command to the FTP server.  */
uerr_t
ftp_retr (int csock, const char *file)
{
   char *request, *respline;
   int nwritten;
   uerr_t err;
   char filecopy[2048];
   if (file[0] != '"') {
 sprintf(filecopy, "\"%.2047s\"", file);
   } else {
 strncpy(filecopy, file, 2047);
   }

   /* Send RETR request.  */
   request = ftp_request ("RETR", filecopy);






--
Rich "wealthychef" Cook
925-784-3077
--
  it takes many small steps to climb a mountain, but the view gets  
better all the time.



wget on gnu.org: Report a Bug

2007-07-07 Thread Tony Lewis
The "Report a Bug" section of http://www.gnu.org/software/wget/ should
encourage submitters to send as much relevant information as possible
including wget version, operating system, and command line. The submitter
should also either send or at least save a copy of the --debug output.

 

Perhaps we need a --bug option for the command line that runs the command
and saves important information in a file that can be submitted along with
the bug report. The saved information would have to be sanitized to remove
things like user IDs and passwords but could include things like the wget
version, command line options, and what the command tried to do.

 

Tony

 



wget on gnu.org: error on Development page

2007-07-07 Thread Tony Lewis
On http://www.gnu.org/software/wget/wgetdev.html, step 1 of the summary is:

1.  Change to the topmost GNU Wget directory:
%  cd wget 

But you need to cd to either wget/trunk or the appropriate version
subdirectory of wget/branches.



RE: wget on gnu.org: error on Development page

2007-07-07 Thread Tony Lewis
Micah Cowen wrote:

> Actually, the wget directory is the trunk in that example, since it was
> checked out with
> 
> $ svn co svn://addictivecode.org/wget/trunk wget

Checking out the code using "trunk" is only one of three examples. I used
the third example, checking out the entire source code repository. (Don't
ask me why I thought I might need it.)

If it had read:

Change to the topmost GNU Wget directory in the branch you want to build;
for example:
%  cd wget

then I wouldn't have submitted the comment. I suspect most people who get to
the point of checking out code can figure it out, but...

Tony



RE: wget on gnu.org: Report a Bug

2007-07-07 Thread Tony Lewis
Micah Cowan wrote:

> This information is currently in the bug submitting form at Savannah:

That looks good.

> I think perhaps such things as the wget version and operating system
> ought to be emitted by default anyway (except when -q is given).

I'm not convinced that wget should ordinarily emit the operating system. It's 
really only useful to someone other than the person running the command.

> Other than that, what kinds of things would --bug provide above and
> beyond --debug?

It should echo the command line and the contents of .wgetrc to the bug output, 
which even the --debug option does not do. Perhaps we will think of other 
things to include in the output if this option gets added.

However, the big difference would be where the output was directed. When 
invoked as:
wget ... --bug bug_report

all interesting (but sanitized) information would be written to the file 
bug_report whether or not the command included --debug, which would also direct 
the debugging output to STDOUT.

The main reason I had for suggesting this option is that it would be easy to 
tell newbies with problems to run the exact same command with "--bug 
bug_report" and send the file bug_report to the list (or to whomever is working 
on the problem). The user wouldn't see the command behave any differently, but 
we'd have the information we need to investigate the report.

It might even be that most of us would choose to run with --bug most of the 
time relying on the normal wget output except when something appears to have 
gone wrong and then checking the file when it does.

Tony



RE: wget on gnu.org: error on Development page

2007-07-07 Thread Tony Lewis
Micah Cowan wrote:

> Done. Lemme know if that works for you.

Looks good




RE: 1.11 Release Date: 15 Sept

2007-07-12 Thread Tony Lewis
Noèl Köthe wrote:

> A switch to the new GPL v3 is a not so small change and like samba
> (3.0.x -> 3.2) would imho be a good reason for wget 1.2 so everybody
> sees something bigger changed.

There already was a version 1.2 (although the program was called geturl at that 
time).

The number scheme could probably use a facelift. Perhaps when we transition to 
2.0, we can add a third digit.

Tony



RE: Maximum 20 Redirections HELP!!!

2007-07-16 Thread Tony Lewis
Josh Williams wrote:

> Let me know how it turns out. The only "testing" I did on it was
> checking to make sure my code compiled; I haven't actually tried the
> option.

That's the only testing a developer is *supposed* to do. Everything else is
QA's job!

;-)

Just forward the patch to [EMAIL PROTECTED] and let them test it. :-)

Tony



RE: Maximum 20 Redirections HELP!!!

2007-07-16 Thread Tony Lewis
Josh Williams wrote:

> Hmm. .org, maybe?

LOL. Do you know how many kewl domain names I had to go through before I
found one that didn't actually exist? Close to a dozen.

Tony



RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote:

> The manpage doesn't need to give as detailed explanations as the info
> manual (though, as it's auto-generated from the info manual, this could
> be hard to avoid); but it should fully describe essential features.

I can't see any good reason for one set of documentation to be different than 
another. Let the user choose whatever is comfortable. Some users may not even 
know they have a choice between man and info.

> While we're on the subject: should we explicitly warn about using such
> features as robots=off, and --user-agent? And what should those warnings
> be? Something like, "Use of this feature may help you download files
> from which wget would otherwise be blocked, but it's kind of sneaky, and
> web site administrators may get upset and block your IP address if they
> discover you using it"?

No, I don't think we should nor do I think use of those features is "sneaky".

With regard to robots.txt, people use it when they don't want *automated* 
spiders crawling through their sites. A well-crafted wget command that 
downloads selected information from a site without regard to the robots.txt 
restrictions is a very different situation. It's true that someone could 
--mirror the site while ignoring robots.txt, but even that is legitimate in 
many cases.

With regard to user agent, many websites customize their output based on the 
browser that is displaying the page. If one does not set user agent to match 
their browser, the retrieved content may be very different than what was 
displayed in the browser.

All that being said, it wouldn't hurt to have a section in the documentation on 
wget etiquette: think carefully about ignoring robots.txt, use --wait to 
throttle the download if it will be lengthy, etc.

Perhaps we can even add a --be-nice option similar to --mirror that adjusts 
options to match the etiquette suggestions.

Tony



RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote:

> Don't we already follow typical etiquette by default? Or do you mean
> that to override non-default settings in the rcfile or whatnot?

We don't automatically use a --wait time between requests. I'm not sure what 
other "nice" options we'd want to make easily available, but there are probably 
more.

Tony



RE: Problem with combinations of the -O , -p, and -k parameters in wget

2007-07-23 Thread Tony Lewis
Michiel de Boer wrote:

> Is there another way though to achieve the same thing?

You can always run wget and then rename the file afterward. If this happens
often, you might want write a shell script to handle it. Of course, If you
want all the references to the file to be converted, the script will be a
little more complicated. :-)

Tony



RE: Overview of the wget source code (command line options)

2007-07-24 Thread Tony Lewis
Himanshu Gupta wrote:

 

> Thanks Josh and Micah for your inputs.

 

In addition to whatever Josh and Micah told you, let me add the information
that follows. More than once I have had to relearn how wget deals with
command line options. The last time I did so, I created the HOWTO that
appears below (comments about this information from those in the know on
this list are welcome). I'm happy to collect any other topics that people
want to submit and add them to the file. Perhaps Micah will even be willing
to add it to the repository. :-)

 

By the way, if your mail reader throws away line breaks, you will want to
restore them. --Tony

 

To find out what a command line option does:

  Look in src/main.c in the option_data array for the string to corresponds
to

  the command line option; the entries are of the form:

  { "option", 'O', TYPE, "data", argtype },

 

  where you're searching for "option".

 

  If TYPE is OPT_BOOLEAN or OPT_VALUE:

Note the value of "data". Then look at init.c at the commands array for

an entry that starts with the same data. These lines are of the form:

{ "data", &opt.variable, cmd_TYPE },

 

The corresponding line will tell you what variable gets set when that
option

is selected. Now use grep or some other search tool to find out where
the

variable is referenced.

 

For example, the --accept option sets the value of opt.accepts, which is

referenced in ftp.c and utils.c

 

  If the TYPE is anything else:

Look to see how main.c handles that TYPE.

 

For example, OPT__APPEND_OUTPUT sets the option named "logfile" and then

sets the variable append_to_log to true. Searching for append_to_log

shows that it is only used in main.c. Checking init.c (as described
above)

for the option "logfile" shows that it sets the value of opt.lfilename,

which is referenced in mswindows.c, progress.c, and utils.c.

 




 

To add a new command line option:

  The simplest approach is to find an existing option that is close to what
you

  want to accomplish and mirror it. You will need to edit the following
files

  as described.

 

  src/main.c

Add a line to the option_data array in the following format:

  { "option", 'O', TYPE, "data", argtype },

 

where:

  option   is the long name to be accepted from the command line

  Ois the short name (one character) to be accepted from the

   command or '' if there is no short name; the short name

   must only be assigned to one option. Also, there are very

   few short names available and the maintainers are not

   inclined to give them out unless the option is likely to

   be used frequently.

  TYPE is one of the following standard options:

 OPT_VALUEon the command line, option must be

  followed by a value that will be stored

  ?somewhere?

 OPT_BOOLEAN  option is a boolean value that may appear

  on the command line as --option for true

  or --no-option for false

 OPT_FUNCALL  an internal function will be invoked if the

  option is selected on the command line

   Note: If one of these choices won't work for your option

   you can add a new value of the OPT__XXX to the enum list

   and add special code to handle it in src/main.c.

  data For OPT_VALUE and OPT_BOOLEAN, the "name" assigned to the

   option in the commands array defined in src/init.c (see

   below). For OPT_FUNCALL, a pointer to the function to be

   invoked.

  argtype  For OPT_VALUE and OPT_BOOLEAN, use -1. For OPT_FUNCALL use

   no_argument.

 

NOTE: The options *must* appear in alphabetical order because a Boolean

search is used for the list.

 

  src/main.c

Add the help string to function print_help as follows:

N_("\

  -O,  --optiondoes something nifty.\n"),

 

If the short name is '', put spaces in place of "-O,".

 

Select a reasonable place to add the text into the help output in one

of the existing groups of options: Startup, Logging and input file,

Download, Directories, HTTP options, HTTPS (SSL/TLS) options,

FTP options, Recursive download, or Recursive accept/reject.

 

  src/options.h

Define the variable to receive the value of the option in the options

structure.

 

  src/init.c

Add a line to the commands array in the following format:

  { "data", &opt.variable, cmd_TYPE },

 

where:

  data  matches the "data" string you entered above in the

options_data array in src/main.c

  variable  is the

RE: VMS support/getpass [Re: Gnulib getpass, and wget password prompting]

2007-08-10 Thread Tony Lewis
Steven M. Schweda wrote:

> I suppose that that wouldn't be _so_ terrible.  I would probably have
> made the general/OS-specific split somewhere else

I like the approach Micah is taking to password prompting. No matter what he
did, you'd likely have to submit a patch to customize password prompting for
VMS. If you submit a patch for a GNU library (and that project accepts it)
then that fix will become available for every future GNU application that is
released on VMS.

> I'll relax and wait for things to deteriorate.

:-)

I think you should give Micah the benefit of the doubt. He seems to be
bending over backwards to "do the right thing" for wget and for wget's users
(including those of us who dabble at changing it).

Tony




RE: wget url with hash # issue

2007-09-06 Thread Tony Lewis
Micah Cowan wrote:

> If you mean that you want Wget to find any file that matches that
> wildcard, well no: Wget can do that for FTP, which supports directory
> listings; it can't do that for HTTP, which has no means for listing
> files in a "directory" (unless it has been extended, for example with
> WebDAV, to do so).

Seems to me that is a big "unless" because we've all seen lots of websites
that have http directory listings. Apache will do it out of the box (and by
default) if there is no index.htm[l] file in the directory.

Perhaps we could have a feature to grab all or some of the files in a HTTP
directory listing. Maybe something like this could be made to work:

wget http://www.exelana.com/images/mc*.gif

Perhaps we would need an option such as --http-directory (the first thing
that came to mind, but not necessarily the most intuitive name for the
option) to explicitly tell wget how it is expected to behave. Or perhaps it
can just try stripping the filename when doing an http request and wildcards
are specified.

At any rate (with or without the command line option), wget would retrieve
http://www.exelana.com/images/ and then retrieve any links where the target
matches mc*.gif.

If wget is going to explicitly support http directory listings, it probably
needs to be intelligent enough to ignore the sorting options. In the case of
Apache, that would be things like Name.

Anyone have any idea how many different http directory listing formats are
out there?

Tony



RE: Wget on Mercurial!

2007-09-20 Thread Tony Lewis
Micah Cowan wrote:

> As I see it, the biggest concern over using git would be multiplatform
> support. AFAICT, git has a great developer community, but a rather
> Linux-focused one. And while, yes, there is Win32 support, the
> impression I have is that it significantly lags Unix/Linux support.
> Mozilla rejected it early on due to this conclusion

The Mozilla community (with a large base of Win32 programmers) rejected an
open-source package that met their needs better than other packages because
it didn't have good enough Win32 support? Why didn't they just add in the
Win32 support so that the rest of the world that cares about Win32 support
could benefit from it?

> git vs hg.

It's a good thing I remember a little high-school chemistry or I'd have no
idea what that meant. I'm assuming "hg" == "Mercurial", but shouldn't it be
hgial? :-)

Tony



RE: wget + dowbloading AV signature files

2007-09-22 Thread Tony Lewis
Gerard Seibert wrote:

> Is it possible for wget to compare the file named "AV.hdb'
> located in one directory, and if it is older than the AV.hdb.gz file
> located on the remote server, to download the "AV.hdb.gz" file to the
> temporary directory?

No, you can only get wget to compare a file of the same name between your
local system and the remote server.

> The only option I have come up with is to keep a copy of the "gz" file
> in the temporary directory and run wget from there.

You will need to keep the original "gz" file with a timestamp matching the
server in order for wget to know that the file you have is the same as the
one on the server.

> Unfortunately, at least as far as I can tell, wget does not issue an
> exit code if it has downloaded a newer file.

Better exit codes is on the wish list.

> It would really be nice though if wget simply issued an exit code if
> an updated file were downloaded.

Yes, it would.

> Therefore, I am unable to craft a script that will unpack the file,
> test and install it if a newer version has been downloaded.

Keep one directory that matches the server and another one (or perhaps two)
where you process new files. Before and after wget runs, you can check the
dates on the directory that matches the server. You only need to process
files that changed.

Hope that helps.

Tony




RE: working on patch to limit to "percent of bandwidth"

2007-10-10 Thread Tony Lewis
Hrvoje Niksic wrote:

> Measuring initial bandwidth is simply insufficient to decide what
> bandwidth is really appropriate for Wget; only the user can know
> that, and that's what --limit-rate does.

The user might be able to make a reasonable guess as to the download rate if
wget reported its average rate at the end of a session. That way the user
can collect rates over time and try to give --limit-rate a reasonable value.

Tony



RE: Recursive downloading and post

2007-10-22 Thread Tony Lewis
Micah Cowan wrote

> Stuart Moore wrote:
> > Is there any way to get wget to only use the post data for the first
> > file downloaded?
>
> Unfortunately, I'm not sure I can offer much help. AFAICT, --post-file
> and --post-data weren't really designed for use with recursive
> downloading.

Perhaps not, but I can't imagine that there is any scenario where the POST
data should legitimately be sent for anything other than the URL(s) on the
command line.

I'd vote for this being flagged as a bug.

Tony



RE: Thoughts on Wget 1.x, 2.0 (*LONG!*)

2007-11-02 Thread Tony Lewis
Micah Cowan wrote:

> Keeping a single Wget and using runtime libraries (which we were terming
> "plugins") was actually the original concept (there's mention of this in
> the first post of this thread, actually); the issue is that there are
> core bits of functionality (such as the multi-stream support) that are
> too intrinsic to separate into loadable modules, and that, to be done
> properly (and with a minimum of maintenance commitment) would also
> depend on other libraries (that is, doing asynchronous I/O wouldn't
> technically require the use of other libraries, but it can be a lot of
> work to do efficiently and portably across OSses, and there are already
> Free libraries to do that for us).

Perhaps both versions can include multi-threaded support in their core version, 
but the lite version would never invoke multi-threading.

Tony



RE: .1, .2 before suffix rather than after

2007-11-16 Thread Tony Lewis
Hrvoje Niksic wrote:

> > And how is .tar.gz renamed?  .tar-1.gz?
>
> Ouch.

OK. I'm responding to the chain and not Hrvoje's expression of pain. :-)

What if we changed the semantics of --no-clobber so the user could specify
the behavior? I'm thinking it could accept the following strings:
- after: append a number after the file name (current behavior)
- before: insert a number before the suffix
- new: change name of new file (current behavior)
- old: change name of old file

With this scheme --no-clobber becomes equivalent to --no-clobber=after,new.
If I want to change where the number appears in the file name or have the
old file renamed then I can specify the behavior I want on the command line
(or in .wgetrc). I think I would change my default to
--no-clobber=before,old.

I think it would be useful to have semantics in .wgetrc where I specify what
I want my --no-clobber default to be without that meaning I want
--no-clobber processing on each invocation. It would be nice if I could say
that I want my default to be "before,old", but to only have that apply when
I specify --no-clobber on the command line.

Back to the painful point at the start of this note, I think we treat
".tar.gz" as a suffix and if --no-clobber=before is specified, the file name
becomes ".1.tar.gz".

Tony



RE: Skip certain includes

2008-01-24 Thread Tony Lewis
Wayne Connolly wrote:

 

> Thanks mate- i know we chatted on IRC but just thought someone

> else may be able to provide some insight.

 

OK. Here's some insight: wget is essentially a web browser. If the URL
starts with "http", then wget sees the exact same content as Internet
Explorer, Firefox, and Opera (except in cases where the server customizes
its content to the user agent - in those cases you may have to tweak the
user agent to see the same content).

 

If the files are visible to FTP, then try using wget with a URL starting
with "ftp" instead.  Otherwise, if you want to mirror the files as they
appear on the server, you will have to use something like scp to transfer
the files directly from Server A to Server B.

 

Tony

 



RE: Not all files downloaded for a web site

2008-01-27 Thread Tony Lewis
Matthias Vill wrote:

> Alexandru Tudor Constantinescu wrote:
> > I have the feeling wget is not really able to figure out which files
> > to download from some web sites, when css files are used.

> That's right. Up until wget 1.11 (released yesterday) there is no
> support for css-files in the matter of parsing links out of it. There
> for wget will download the css-file, but not any file referenced only
there.

According to Micah's "Future of Wget" email, CSS support is planned for
1.12. He wrote:

> 1.12
> - 
>  Support for parsing links from CSS.
[snip]
> The really big deal here, to me, is CSS. I want to have CSS support for
> Wget ASAP. It's an essential part of the Web, and users definitely
> suffer for the lack of support for it.

Tony



RE: retrieval of data from a database

2008-06-10 Thread Tony Lewis
Saint Xavier wrote:

> Well, you'd better escape the '&' in your shell (\&)

It's probably easier to just put quotes around the entire URL than to try to
find all the special characters and put backslashes in front of them.

Tony




RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
Micah Cowan wrote:

> Unfortunately, nothing really comes to mind. If you'd like, you could
> file a feature request at
> https://savannah.gnu.org/bugs/?func=additem&group=wget, for an option
> asking Wget to treat URLs case-insensitively.

To have the effect that Allan seeks, I think the option would have to convert 
all URIs to lower case at an appropriate point in the process. I think you 
probably want to send the original case to the server (just in case it really 
does matter to the server). If you're going to treat different case URIs as 
matching then the lower-case version will have to be stored in the hash. The 
most important part (from the perspective that Allan voices) is that the 
versions written to disk use lower case characters.

Tony



RE: Wget 1.11.3 - case sensetivity and URLs

2008-06-13 Thread Tony Lewis
mm w wrote:

> standard: the URL are case-insensitive
>
> you can adapt your software because some people don't respect standard,
> we are not anymore in 90's, let people doing crapy things deal with
> their crapy world

You obviously missed the point of the original posting: how can one 
conveniently mirror a site whose server uses case insensitive names onto a 
server that uses case sensitive names.

If the original site has the URI strings "/dir/file", "dir/File", "Dir/file", 
and "/Dir/File", the same local file will be returned. However, wget will treat 
those as unique directories and files and you wind up with four copies.

Allan asked if there is a way to have wget just create one copy and proposed 
one way that might accomplish that goal.

Tony



<    1   2   3   >