Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"
> The problem was that that link:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
> instead of being properly converted to:
> http://mineraly.feedle.com/Ftp/UpLoad/
Or, in fact, wget's default:
http://mineraly.feedle.com/Ftp/UpLoad/index.html

> was left like this on the main mirror page:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
> and hence while clicking on it:
> "Not Found
> The requested URL /Mineraly/Ftp/UpLoad/index.html was not found on this 
> server."

Yup. So I assume that the problem you see is not that of wget mirroring, but
a combination of saving to a custom dir (with --cut-dirs and the like) and
conversion of the links. Obviously, the link to
http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html which would be
correct for a standard "wget -m URL" was carried over while the custom link
to http://mineraly.feedle.com/Ftp/UpLoad/index.html was not created.
My test with wget 1.5 just was a simple "wget15 -m -np URL" and it worked. 
So maybe the convert/rename problem/bug was solved with 1.9.1
This would also explain the "missing" gif file, I think.

Jens



-- 
+++ GMX - die erste Adresse für Mail, Message, More +++

10 GB Mailbox, 100 FreeSMS  http://www.gmx.net/de/go/topmail


Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"
> > Wget saves a mirror to your harddisk. Therefore, it cannot rely on an
> apache
> > server generating a directory listing. Thus, it created an index.html as
> 
> Apparently you have not tried to open that link, 
Which link? The non-working one on your incorrect mirror or the working one
on my correct mirror on my HDD?

> got it now?
No need to get snappy, Andrzej.

>From your other mail:
> No, you did not understand. I run wget on remote machines. 
Ah! Sorry, missed that.

> The problem is 
> solved though by running the 1.9.1 wget version.
I still am wondering, because even wget 1.5 correctly generates the
index.html from the server output, when called on my local box.
I really do not know what is happening on your remote machine, but my wget
1.5 is able to mirror the site. It creates the
Mineraly/Ftp/UpLoad/index.html file and the correct link to it. 
I understand that it is not what you want (having an index.html), but wget
1.5 creates a working mirror - as it is supposed to do.

CU
Jens





-- 
+++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl


Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"
Do I understand correctly that the mirror at feeble is created by you and
wget?

> > Yes, because this is in th HTML file itself:
> > "http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html";
> > It does not work in a browser, so why should it work in wget?
> It works in the browser:
> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/
> The is no index.html and the content of the directory is displayed.
I assume I was confused by the different sites you wrote about. I was sure
that both included the same link to ...index.html and the same gif-address.

> http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html
> The link was not converted properly, it should be:
> http://mineraly.feedle.com/Ftp/UpLoad/
> and it should be without any index.html, because there is none in the 
> original.
Wget saves a mirror to your harddisk. Therefore, it cannot rely on an apache
server generating a directory listing. Thus, it created an index.html as
Tony Lewis explained. Now, _you_ uploaded (If I understood correctly) the
copy from your HDD but did not save the index.html. Otherwise it would be
there and it would work.

Jens

-- 
+++ GMX - die erste Adresse für Mail, Message, More +++

10 GB Mailbox, 100 FreeSMS  http://www.gmx.net/de/go/topmail


Re: links conversion; non-existent index.html

2005-05-01 Thread "Jens Rösner"

> I know! But that is intentionally left without index.html. It should 
> display content of the directory, and I want that wget mirror it 
> correctly.
> Similar situation is here:
> http://chemfan.pl.feedle.com/arch/chemfanftp/
> it is left intentionally without index.html so that people could download 
> these archives. 
Is something wrong with my browser?
This looks not like a simple directory listing, this file has formatting and
even a background image. http://chemfan.pl.feedle.com/arch/chemfanftp/ looks
the same as http://chemfan.pl.feedle.com/arch/chemfanftp/index.html in my
Mozilla and wget downloads it correctly.

> If wget put here index.html in the mirror of such site 
> then there will be no access to these files.
IMO, this is not correct. index.html will include the info the directory
listing contains at the point of download.
This works for me with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well -
what seemed to be  problem according to your other post.

> Well, if wget "has to" put index.html is such situations then wget is not 
> suitable for mirroring such sites, 
What exactly do you mean? It seems to work for me, e.g. index.html looks
like the apache-generated directory listing. When mirroring, index.html will
be re-written if/when it has changed on the server since the last mirroring.

> and I expect that problem to be 
> corrected in future wget versions.
You "expect"??

Jens

-- 
+++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl


Re: newbie question

2005-04-14 Thread Jens Rösner
Hi Alan!

As the URL starts with https, it is a secure server. 
You will need to log in to this server in order to download stuff.
See the manual for info how to do that (I have no experience with it).

Good luck
Jens (just another user)


>  I am having trouble getting the files I want using a wildcard
> specifier (-A option = accept list).  The following command works fine to
get an
> individual file:
> 
> wget
>
https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/160RDTEN_FY06PB.pdf
> 
> However, I cannot get all PDF files this command: 
> 
> wget -A "*.pdf"
>
https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/
> 
> Instead, I get:
> 
> Connecting to 164.224.25.30:443 . . . connected.
> HTTP request sent, awaiting response . . . 400 Bad Request
> 15:57:52  ERROR 400: Bad Request.
> 
>I also tried this command without success:
> 
> wget
>
https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/*.pdf
> 
> Instead, I get:
> 
> HTTP request sent, awaiting response . . . 404 Bad Request
> 15:57:52  ERROR 404: Bad Request.
> 
>  I read through the manual but am still having trouble.  What am I
> doing wrong?
> 
> Thanks, Alan
> 
> 
> 

-- 
+++ NEU: GMX DSL_Flatrate! Schon ab 14,99 EUR/Monat! +++

GMX Garantie: Surfen ohne Tempo-Limit! http://www.gmx.net/de/go/dsl


Re: wget 1.9.1 with large DVD.iso files

2005-04-11 Thread Jens Rösner
Hi Sanjay!

This is a known issue with wget until 1.9.x. 
wget 1.10, which is currently in alpha status, fixes this problem. 
I do not know how much experience you have with this kind of stuff, but you
could download the alpha source code and compile&test it.

CU
Jens
(just another user)

> wget 1.9.1 fails when trying to download a very large file.
>  
> The download stopped in between and attempting to resume shows a negative
> sized balance to be downloaded.
>  
>
e.g.ftp://ftp.solnet.ch/mirror/SuSE/i386/9.2/iso/SUSE-Linux-9.2-FTP-DVD.iso
> 3284710 KB 
>  

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl


Re: File rejection is not working

2005-04-06 Thread "Jens Rösner"
Hi Jerry!

AFAIK, RegExp for (HTML?) file rejection was requested a few times, but is
not implemented at the moment.

CU
Jens (just another user)

> The "-R" option is not working in wget 1.9.1 for anything but
> specifically-hardcoded filenames..
>  
> file[Nn]ames such as [Tt]hese are simply ignored...
>  
> Please respond... Do not delete my email address as I am not a
> subscriber... Yet
>  
> Thanks
>  
> Jerry
> 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl


Re: -X regex syntax? (repost)

2005-02-17 Thread Jens Rösner
Hi Vince!

> So, so far these don't work for me:
> 
> --exclude-directories='*.backup*'
> --exclude-directories="*.backup*"
> --exclude-directories="*\.backup*"

Would -X"*backup" be OK for you? 
If yes, give it a try.
If not, I think you'd need the correct escaping for the ".", 
but I have no idea how to do that, but 
http://mrpip.orcon.net.nz/href/asciichar.html
lists
%2E
as the code. Does this work?

CU
Jens


> 
> I've also tried this on my linux box running v1.9.1 as well. Same results.
> Any other ideas?
> 
> Thanks a lot for your tips, and quick reply!
> 
> /vjl/

-- 
Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS
GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail


Re: perhaps a Bug? "No such file or directory"

2005-01-23 Thread Jens Rösner
Moin
Michael!

http://www.heise.de/ttarif/druck.shtml?function=preis&Standort=0251&NEB=354&DAAuswahl=1234&CbCoA=alle&CbCmA=keine&Pre=keine&DA=folgende&Anbietername=False&Tarifnamen=True&Zonennamen=False&Netzzahl=True&Rahmen=True&Berechnung=real&Laenge=3&Abrechnung=False&Tag=1&Easy=True&0190=No&;
> This link works in a browser!

Ok, I can reproduce this error in Win2000, even with a quoted URL.
The reason is -I guess- the filename length limit of 255 characters. 
The heise link has around 280!
Workaround: 
wget -O tarife.html [Other Options] URL
saves the file to tarife.html

Good luck
Jens (just another user)

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot


--cache=off: misleading manual?

2005-01-16 Thread Jens Rösner
Hi Wgeteers!
I understand that -C off as short for --cache=off
was dropped, right?
However, the wget.html that comes with Herold's
Windows binary mentions only
  --no-cache
and the wgetrc option
  cache = on/off
I just tried
1.9+cvs-dev-200404081407 unstable development version
and
--cache=off still works.
I think this is not the latest cvs version and
possibly the manual will be updated accordingly.
But I think it would be nice to mention
that --cache=off still works for backwards compatibility.
I am aware that there are bigger tasks (LFS)
currently, I just stumbled over this "issue" and
thought I'd mention it.
I hope I am not missing something!
Jens



Re: wget 1.9.1

2004-10-18 Thread Jens Rösner
Hi Gerriet!

> Only three images, which were referenced in styles.css, were missing.
Yes, wget does not parse css or javascript.

> I thought that the -p option "causes Wget to download all the files 
> that are necessary to properly display a given HTML page. This includes 
> such things as inlined images, sounds, and referenced stylesheets."
The referenced stylesheet whatever.css should have been saved.

> Yes, it does not explicitly mention images referenced in style sheets, 
> but it does claim to download everything "necessary to properly display 
> a given HTML page".
I think this paragraph is misleading. As soon as JavaScript or CSS are
involved in certain ways (like displaying images), -p will not be able to
fully display the site.

> So - is this a bug, 
no - it is a missing feature.

> did I misunderstand the documentation, 
somehow

> did I use the wrong options?
kind of, but as the right options don't exist, you are not to blame ;)

> Should I get a newer version of wget? 
1.9.1 is the latest stable version according to http://wget.sunsite.dk/

CU
Jens (just another user)


-- 
GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail
+++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++



Re: Recursion limit on 'foreign' host

2004-10-17 Thread Jens Rösner
Hi Manlio!

If I remember correctly, this request has 
been brought up quite some time ago and quite 
many people thought that it is a good idea. 
But I think it has not been implemented 
in a patch until now. 

CU
Jens



> Hi.
> wget is very powerfull and well designed, but there is a problem.
> There is no way to limit the recursion depth on foreign host.
> I want to use -m -H options because some sites have resources (binary 
> files) hosted on other host,
> it would be nice to say:
> 
> wget -m -H --external-level=1 ...
> 
> 
> 
> Thanks and regards   Manlio Perillo
> 

-- 
GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail
+++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++



Re: img dynsrc not downloaded?

2004-10-17 Thread Jens Rösner
dynsrc is Microsoft DHTML for IE, if I am not mistaken.
As wget is -thankfully- not MS IE, it fails.
I just did a quick google and it seems that the use of  
dynsrc is not recommended anyway. 

What you can do is to download 
http://www.wideopenwest.com/~nkuzmenko7225/Collision.mpg

Jens

(and before you ask, no I am not a developer of wget, just a user)



> Hello.
> Wget could not follow dynsrc tags; the mpeg file was not downloaded:
>   
> at
>   http://www.wideopenwest.com/~nkuzmenko7225/Collision.htm
> 
> Regards,
> Juhana
> 

-- 
+++ GMX DSL Premiumtarife 3 Monate gratis* + WLAN-Router 0,- EUR* +++
Clevere DSL-Nutzer wechseln jetzt zu GMX: http://www.gmx.net/de/go/dsl



-- 
GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail
+++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++



Re: feature request: treating different URLs as equivalent

2004-07-09 Thread Jens Rösner
Hi Seb!

I am not sure if I understand your problem completely, 
but if you don't mind, I'll try to help anyway.

Have you tried 
wget --cut-dirs=1 --directory-prefix=www.foo.com --span-hosts 
--domains=www.foo.com,foo.com,www.foo.org www.foo.com 

I think that could work. Maybe you'll need to add --no-clobber?

CU
Jens


> The other day I wanted to use wget to create an archive of the entire
> www.cpa-iraq.org website.  It turns out that http://www.cpa-iraq.org,
> http://cpa-iraq.org, http://www.iraqcoalition.org and
> http://iraqcoalition.org all contain identical content. Nastily, absolute
> links to sub-URLs of all of those hostnames are sprinkled throughout the
> site(s).
> 
> To be sure of capturing the whole site, then, I need to tell wget to
> follow links between the four domains I gave above. But because the site
> does sometimes use relative links, that ends up with the site content
> spread across 4 directories with much (but not complete) duplication
> between them.  This is wasteful and messy.



Re: Cannot WGet Google Search Page?

2004-06-12 Thread Jens Rösner
Hi Phil!

Without more info (wget's verbose or even debug output, full command
line,...) I find it hard to tell what is happening.
However, I have had very good success with wget and google.
So, some hints:
1. protect the google URL by enclosing it in "
2. remember to span (and allow only certain) hosts, otherwise, wget will
only download google pages 
And lastly -but you obviously did so- think about restricting the recursion
depth.

Hope that helps a bit
Jens

 > I have been trying to wget several levels deep from a Google search page
> (e.g., http://www.google.com/search?=deepwater+oil). But on the very first
> page, wget returns a 403 Forbidden error and stops. Anyone know how I can
> get around this?
> 
> Regards, Phil 
> Philip E. Lewis, P.E.
> [EMAIL PROTECTED]
> 
> 

-- 
"Sie haben neue Mails!" - Die GMX Toolbar informiert Sie beim Surfen!
Jetzt aktivieren unter http://www.gmx.net/info



Re: Startup delay on Windows

2004-02-08 Thread Jens Rösner
[...]
>Cygwin considers `c:\Documents and Settings\USERNAME' to be the
>home directory.  I wonder if that is reachable through registry...
> 
> Does anyone have an idea what we should consider the home dir under
> Windows, and how to find it?

Doesn't this depend on each user's personal preference?
I think most could live with
c:\Documents and Settings\all users (or whatever it is called in each
language) 
or the cygwin approach 
c:\Documents and Settings\USERNAME
which will be less likely to conflict with security limits on multi-user PCs
I think.
I personally would like to keep everything wget-ish in the directory its exe
is in 
and treat that as its home dir.

BTW: 
Is this bug connected to the bug under Windows, that saving into another
directory 
than wget's starting dir by using the -P (--directory-prefix) option 
does not work when switching drives?

wget -r -P C:\temp URL
will save to
.\C3A\temp\*.*

wget -r -P 'C:\temp\' URL
will save to
.\'C3A\temp\'\*.*

wget -r -P "C:\temp\" URL
does not work at all ('Missing URL') error

however
wget -r -P ..\temp2\ URL
works like a charme.

CU
Jens





-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



Re: skip robots

2004-02-08 Thread Jens Rösner
> You're close.  You forgot the `-u' option to diff (very important),
> and you snipped the beginning of the `patch' output (also important).

Ok, I forgot the -u switch which was stupid as I actually read 
the command line in the patches file :(
But concerning the snipping I just did 
diff  > file.txt 
so I cannot have snipped anything. Is my shell (win2000) 
doing something wrongly or is the missing bit there now (when using the -u
switch).

Jens

Once more:

Patch sum up:
a) Tell users how to --execute more than one wgetrc command
b) Tell about and link to --execute when listing wgetrc commands.
Reason: Better understanding and navigating the manual

ChangeLog entry:
Changed wget.texi concerning --execute switch to facilitate 
use and user navigation.

Start patch:

--- wget.texi   Sun Nov 09 00:46:32 2003
+++ wget_mod.texi   Sun Feb 08 20:46:07 2004
@@ -406,8 +406,10 @@
 @itemx --execute @var{command}
 Execute @var{command} as if it were a part of @file{.wgetrc}
 (@pxref{Startup File}).  A command thus invoked will be executed
[EMAIL PROTECTED] the commands in @file{.wgetrc}, thus taking precedence over
-them.
[EMAIL PROTECTED] the commands in @file{.wgetrc}, thus taking precedence over 
+them. If you need to use more than one wgetrc command in your
+command-line, use -e preceeding each.
+
 @end table
 
 @node Logging and Input File Options, Download Options, Basic Startup
Options, Invoking
@@ -2147,8 +2149,9 @@
 integer, or @samp{inf} for infinity, where appropriate.  @var{string}
 values can be any non-empty string.
 
-Most of these commands have command-line equivalents (@pxref{Invoking}),
-though some of the more obscure or rarely used ones do not.
+Most of these commands have command-line equivalents (@pxref{Invoking}).
Any
+wgetrc command can be used in the command-line by using the -e (--execute)
(@pxref{Basic Startup Options}) switch.
+
 
 @table @asis
 @item accept/reject = @var{string} 






-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



Re: skip robots

2004-02-08 Thread Jens Rösner
Hi Hrvoje!

> In other words, save a copy of wget.texi, make the change, and send the
> output of `diff -u wget.texi.orig wget.texi'.  That's it.

Uhm, ok. 
I found diff for windows among other GNU utilities at
http://unxutils.sourceforge.net/
if someone is interested.

> distribution.  See
> http://cvs.sunsite.dk/viewcvs.cgi/*checkout*/wget/PATCHES?rev=1.5

Thanks, I tried to understand that. Let's see if I understood it.
Sorry if I am not sending this to the patches list, the document above 
says that it is ok to evaluate the patch with the general list.

CU
Jens


Patch sum up:
a) Tell users how to --execute more than one wgetrc command
b) Tell about and link to --execute when explaining wgetrc commands.
Reason: Better understanding and navigating the manual.

ChangeLog entry:
Changed wget.texi concerning --execute switch to facilitate 
use and user navigation.

Start patch:


409,410c409,412
< @emph{after} the commands in @file{.wgetrc}, thus taking precedence over
< them.
---
> @emph{after} the commands in @file{.wgetrc}, thus taking precedence over 
> them. If you need to use more than one wgetrc command in your
> command-line, use -e preceeding each.
> 
2150,2151c2152,2154
< Most of these commands have command-line equivalents (@pxref{Invoking}),
< though some of the more obscure or rarely used ones do not.
---
> Most of these commands have command-line equivalents (@pxref{Invoking}).
Any
> wgetrc command can be used in the command-line by using the -e (--execute)
(@pxref{Basic Startup Options}) switch.
> 


-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



Re: Why no -nc with -N?

2004-02-05 Thread Jens Rösner
Hi Dan,

I must admit that I don't fully understand your question.

-nc
means no clobber, that means that files that already exist
locally are not downloaded again, independent from their age or size or 
whatever.

-N
means that only newer files are downloaded (or if the size differs).

So these two options are mutually exclusive.
I could imagine that you want something like
wget --no-clobber --keep-server-time URL
right?
If I understand the manual correctly, this date should normally be kept 
for http,
at least if you specify
wget URL
I just tested this and it works for me.
(With -S and/or -s you can print the http headers, if you need to.)

However, I noticed that quite many servers do not provide a 
last-modified header.

Did this answer your question?
Jens






> I'd love to have an option so that, when mirroring, it
> will backup only files that are replaced because they
> are newer on the source system (time-stamping).
>
> Is there a reason these can't be enabled together?
>
> __
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free web site building tool. Try it!
> http://webhosting.yahoo.com/ps/sb/
>



> I'd love to have an option so that, when mirroring, it
> will backup only files that are replaced because they
> are newer on the source system (time-stamping).
> 
> Is there a reason these can't be enabled together?
> 
> __
> Do you Yahoo!?
> Yahoo! SiteBuilder - Free web site building tool. Try it!
> http://webhosting.yahoo.com/ps/sb/
> 

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



Re: skip robots

2004-02-04 Thread Jens Rösner
use 
robots = on/off in your wgetrc
or 
wget -e robots = on/off URL in your command line

Jens

PS: One note to the manual editor(s?): 
The -e switch could be (briefly?) mentioned 
also at the "wgetrc commands" paragraph. 
I think it would make sense to mention it there again 
without clustering the manual too much. 
Currently it is only mentioned in "Basic Startup Options"
(and in an example dealing with robots).
Opinions?



> I onced used the "skip robots" directive in the wgetrc file.
> But I can't find it anymore in wget 1.9.1 documentation.
> Did it disapeared from the doc or from the program ?
> 
> Please answer me, as I'm not subscribed to this list
> 

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



Re: downloading multiple files question...

2004-02-03 Thread Jens Rösner
Hi Ron!

If I understand you correctly, you could probably use the 
-A acclist
--accept acclist
accept = acclist
option.

So, probably (depending on your site), the syntax should be something like:
wget -r -A *.pdf URL
wget -r -A *.pdf -np URL
or, if you have to recurse through multiple html files, 
it could be necessary/beneficial to
wget -r -l0 -A *.pdf,*.htm* -np URL

Hope that helps (and is correct ;) )
Jens


> In the docs I've seen on wget, I see that I can use wildcards to 
> download multiple files on ftp sites.  So using *.pdf would get me all 
> the pdfs in a directory.  It seems that this isn't possible with http 
> sites though.  For work I often have to download lots of pdfs when 
> there's new info I need, so is there any way to download multiple files 
> of the same type from an http web page?
> 
> I'd like to be cc'd in replies to my post please as I'm not subscribed 
> to the mailing list.
> 

-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



RE: apt-get via Windows with wget

2004-02-03 Thread Jens Rösner
Hi Heiko!

> > Until now, I linked to your main page. 
> > Would you mind if people short-cut this? 
> Linking to the directory is bad since people would download 

Sorry, I meant linking directly to the "latest" zip.
However, I personally prefer to read what the provider 
(in this case you) has to say about a download anyway.


> Do link to the complete url if you prefer to, although I like to keep 
> some stats.

Understood.


> for example since start of the year
> there have been 7 referrals from www.jensroesner.de/wgetgui 

Wow, that's massive... 
...not!
;-)


> Since that is about 0.05% stats shouldn't be 
> altered too much if you link directly to the archive ;-)

Thanks for pointing that out ;-}


> > What do you think about adding a "latest-ssl-libraries.zip"?
> I don't think so.
> If you get the "latest complete wget" archive those are included anyway 
> and you are sure it will work. 

Oh I'm very sorry, must have overread/misunderstood that. 
I thought the "latest" zip would not contain the SSLs.
That's great.


> I'd prefer to not force a unneeded (admittedly, small) download by
bundling 
> the ssl libraries in every package.

Very true.
Especially as wget seems to be used by quite some people on slow
connections.


Kind regards
Jens




-- 
GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...)
jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++



Re: problem with LF/CR etc.

2003-11-20 Thread &quot;Jens Rösner"
Hi Hrvoje and all others,

> > It would do away with multiple (sometimes obscure) options few 
> > users use and combine them in one.
> You don't need bitfields for that; you can have an option like
> `--strict-html=foo,bar,baz' where one or more of "foo", "bar" and
> "baz" are recognized and internally converted to the appropriate
> bitmasks.  That way the user doesn't need to remember the numbers and
> provide the sum.

#slapsforeheadwithhand#
Of course, thanks! 


> > I meant that in this case the user can still change wget's behaviour
> > by using the "strict comment parsing" option.  I think that
> > contradicts what you said about wget dealing with a situation of bad
> > HTML all by itself.
> 
> I don't see the contradiction: handling bad HTML is *on* by default,
> no user assistance is required.  If the user requests
> standard-compliant behavior, then HTML with broken comments will no
> longer work, but the person who chose the option is probably well
> aware of that.

I thought you disliked the idea of a --lax-LFCR switch because 
of the additionally necessary user interaction _and_ 
the fact that it would create another option.
To argue against the first point, I mentioned the --strict-comments switch 
(which requires similar user interaction) 
and for the latter problem I suggested combining several 
--lax-foo and --strict-foo switches into one.


> I think you missed the point that I consider the --strict-foo (or
> --lax-foo) switches undesirable, and that the comment parsing switch
> is an intentional exception because the issue has been well-known for
> years.

I am aware that you are hesitating to implement even more options.
And I think you are right to do so!
In my first mail I just wanted to say, that _if_ it is really, really
necessary to have 
another --lax-foo or --strict-foo switch, it could be combined with 
the already existing --strict-comments switch to yield a tidy command-line.
I can't comment on the necessity of a --lax-LFCR switch 
in whatever appearance.
I also don't know how difficult coding it would be.
I just wanted to provide some ideas to implement it, 
if it turns out to be indeed necessary.
Sorry for any turbulence I created :(

CU
Jens



-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: problem with LF/CR etc.

2003-11-20 Thread &quot;Jens Rösner"
Hi Hrvoje and all other,

> It forces the user to remember what each number means, and then have
> to add those numbers in his head.  Unix utilities already have the
> reputation of being user-unfriendly; why make things worse?

It would do away with multiple (sometimes obscure) options few 
users use and combine them in one. 
Those users who use these options are (I would think) 
those who can deal with a bitcode option.
All others would (have to) live with the default setting.
I don't want to push anyone towards a bitcode by any means!!
Just my $0.02 on options.

> > But I think wget is already breaking this rule with the 
> > implementation of comment-parsing, or am I mistaken?
> You are mistaken.  Lax comment parsing is on by default, 
Sorry, I think I did not make myself understood:
I know that lax comment parsing is "on" by default.
I meant that in this case the user can still change wget's behaviour 
by using the "strict comment parsing" option.
I think that contradicts what you said about wget 
dealing with a situation of bad HTML all by itself.
Also, to my mind, the difference between
--stricthtmlrules
and
--laxhtmlrules
is just a negation of the syntax.

> > We could make "full relaxation" the default and use the inverted
> > option --stricthtmlrules, to exclude certain relaxations. This is
> > probably more "automatic downloading"ish.
> I don't like this idea because it means additional code to handle
> something that noone really cares about anyway ("strict HTML").  
Again I don't understand:
What I suggested is more or less to generalize the 
"strict html comments" option towards other cases of wrong HTML.
I really don't see a fundamental difference here. 
The better documentation and wider occurence of bad comments 
are reasons why "strict html comments" is a more needed option 
then to deal with LF/CR. 
If you say that an option for LF/CR s not needed, I trust you.
But I still do not see a fundamental difference in implementation 
for the user.

CU
Jens


-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: problem with LF/CR etc.

2003-11-20 Thread &quot;Jens Rösner"
Hi, 

just an additional remark to my own posting.

> > There is no easy way to punish the culprit.  The only thing you can do
> > in the long run is refuse to interoperate with something that openly
> > breaks applicable standards.  Otherwise you're not only rewarding the
> > culprit, but destroying all the other tools because they will sooner
> > or later collapse under the weight of kludges needed to support the
> > broken HTML.
> 
> I can't argue with that. 
> However, from the _user's_ point of view, _wget_ would seem to be broken, 
> as the user's webbrowser probably shows everything correctly.
> If it is decided that wget does not consider links with LF/CR 
> in them, then IMHO, the user should get informed what happened. 

I just realized that this would mean that wget has to be able to detect the 
breaking of rules and then give a message to the user. 
That creates a situation where:
a) wget has to be smart enough that _something_ is broken (and not just a
404)
b) wget would ideally be so smart to know _what_ is broken
c) the user thinks: Well, if wget knows what is wrong, why doesn't wget
correct it?
On the other hand, not giving even a brief message like 
"Invalid HTML code found, downloaded files may be unwanted."
I don't know how to balance that :(

CU
Jens

-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: problem with LF/CR etc.

2003-11-20 Thread &quot;Jens Rösner"
Hi all!

> There is no easy way to punish the culprit.  The only thing you can do
> in the long run is refuse to interoperate with something that openly
> breaks applicable standards.  Otherwise you're not only rewarding the
> culprit, but destroying all the other tools because they will sooner
> or later collapse under the weight of kludges needed to support the
> broken HTML.

I can't argue with that. 
However, from the _user's_ point of view, _wget_ would seem to be broken, 
as the user's webbrowser probably shows everything correctly.
If it is decided that wget does not consider links with LF/CR 
in them, then IMHO, the user should get informed what happened. 

> > But if (for whatever reasons) an option is unavoidable, I would 
> > suggest something like
> > --relax_html_rules #integer
> > where integer is a bit-code (I hope that's the right term).
> 
> This is not what GNU options usually look like and how they work
> (underscores in option name, bitfields).  
underscores: Sorry, I just gave an example, I'm not a GNUer ;)
bitfields: Ok. Any (short) reason for that? Is it consider 
as not transparent or as ugly?

> But more importantly, I
> really don't think this kind of option is appropriate.  Wget should
> either detect the brokenness and handle it automatically, or refuse to
> acknowledge it altogether.  The worst thing to do is require the user
> to investigate why the HTML didn't parse, only to discover that Wget
> in fact had the ability to process it, but didn't bother to do so by
> default.

Hm, well, I can see your point there. 
But I think wget is already breaking this rule with the 
implementation of comment-parsing, or am I mistaken?
We could make "full relaxation" the default and 
use the inverted option --stricthtmlrules, to exclude certain 
relaxations. This is probably more "automatic downloading"ish.

CU
Jens






-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: problem with LF/CR etc.

2003-11-20 Thread &quot;Jens Rösner"
Hi!

> Do you propose that squashing newlines would break legitimate uses of
> unescaped newlines in links?  
I personally think that this is the main question.
If it doesn't break other things, implement "squashing newlines" 
as the default behaviour.

> Or are you arguing on principle that
> such practices are too heinous to cater to by default?  
Well, if I may speak openly, 
I don't think wget should be a moralist here.
If the fix is easy to implement and doesn't break things, let's do it. 
After all, ignoring these links does not punish the culprit (the HTML coder)

but the innocent user, who expects that wget will download the site.

> IMHO we should either cater to this by default or not at all.
Agreed.
But if (for whatever reasons) an option is unavoidable, I would 
suggest something like
--relax_html_rules #integer
where integer is a bit-code (I hope that's the right term). 
For example
0 = off
1 (2^0)= smart comment checking
2 (2^1)= smart line-break checking
4 (2^2)= option to come
8 (2^3)= another option to come
So specifiying
wget -m --relax_html_rules 0 URL
would ensure strict HTML obeyance, while
wget -m --relax_html_rules 15 URL
would relax the above mentioned rules
By using this bit-code, one integer is able 
to represent all combinations of relaxations 
by summing up the individual options.
One could even think about 
wget -m --relax_html_rules inf URL
to ensure that _all_ rules are relaxed, 
to be upward compatible with future wget versions.
Whether 
--relax_html_rules inf
or
--relax_html_rules 0
or 
--relax_html_rules another-combination-that-makes-most-sense
should be default, is up to negotiation.
However, I would vote for complete relaxation.

I hope that made a bit of sense
Jens





-- 
GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen!

Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken
tolle Preise. http://www.gmx.net/de/cgi/specialmail/

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: How to send line breaks for textarea tag with wget

2003-11-16 Thread &quot;Jens Rösner"
Hi Jing-Shin!

> Thanks for the pointers. Where can I get a version that support
> the --post-data option? My newest version is 1.8.2, but it doesn't
> have this option. -JS

Current version is 1.9.1.
The wget site lists download options on 
http://wget.sunsite.dk/#downloading

Good luck
Jens


-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: Web page "source" using wget?

2003-10-13 Thread Jens Rösner
Hi Hrvoje!

> > retrieval, eventhough the cookie is there.  I think that is a
> > correct behaviour for a secure server, isn't it?
> Why would it be correct?  
Sorry, I seem to have been misled by my own (limited) experience:
>From the few secure sites I use, most will not let you 
log in again after you closed and restarted your browser or redialed 
your connection. That's what reminded my of Suhas' problem.

> Even if it were the case, you could tell Wget to use the same
> connection, like this:
> wget http://URL1... http://URL2...
Right, I always forget that, thanks!

Cya
Jens



-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: Web page "source" using wget?

2003-10-13 Thread Jens Rösner
Hi Suhas!

Well, I am by no means an expert, but I think that wget 
closes the connection after the first retrieval. 
The SSL server realizes this and decides that wget has no right to log in 
for the second retrieval, eventhough the cookie is there.
I think that is a correct behaviour for a secure server, isn't it?

Does this make sense? 
Jens


> A slight correction the first wget should read:
> 
> wget --save-cookies=cookies.txt 
> http://customer.website.com/supplyweb/general/default.asp?UserAccount=U
> SER&AccessCode=PASSWORD&Locale=en-us&TimeZone=EST:-300&action-Submi
> t=Login
> 
> I tried this link in IE, but it it comes back to the same login screen. 
> No errors messages are displayed at this point. Am I missing something? 
> I have attached the "source" for the login page.
> 
> Thanks,
> Suhas
> 
> 
> - Original Message - 
> From: "Suhas Tembe" <[EMAIL PROTECTED]>
> To: "Hrvoje Niksic" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Monday, October 13, 2003 11:53 AM
> Subject: Re: Web page "source" using wget?
> 
> 
> I tried, but it doesn't seem to have worked. This what I did:
> 
> wget --save-cookies=cookies.txt 
> http://customer.website.com?UserAccount=USER&AccessCode=PASSWORD&Loca
> le=English (United States)&TimeZone=(GMT-5:00) Eastern Standard Time 
> (USA & Canada)&action-Submit=Login
> 
> wget --load-cookies=cookies.txt 
> http://customer.website.com/supplyweb/smi/inventorystatus.asp?cboSupplier
> =4541-134289&status=all&action-select=Query 
> --http-user=4542-134289
> 
> After executing the above two lines, it creates two files: 
> 1). "[EMAIL PROTECTED]" :  I can see that 
> this file contains a message (among other things): "Your session has 
> expired due to a period of inactivity"
> 2). "[EMAIL PROTECTED]"
> 
> Thanks,
> Suhas
> 
> 
> - Original Message - 
> From: "Hrvoje Niksic" <[EMAIL PROTECTED]>
> To: "Suhas Tembe" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>
> Sent: Monday, October 13, 2003 11:37 AM
> Subject: Re: Web page "source" using wget?
> 
> 
> > "Suhas Tembe" <[EMAIL PROTECTED]> writes:
> > 
> > > There are two steps involved:
> > > 1). Log in to the customer's web site. I was able to create the 
> following link after I looked at the  section in the "source" as 
> explained to me earlier by Hrvoje.
> > > wget 
> http://customer.website.com?UserAccount=USER&AccessCode=PASSWORD&Loca
> le=English (United States)&TimeZone=(GMT-5:00) Eastern Standard Time 
> (USA & Canada)&action-Submit=Login
> > 
> > Did you add --save-cookies=FILE?  By default Wget will use cookies,
> > but will not save them to an external file and they will therefore be
> > lost.
> > 
> > > 2). Execute: wget
> > > 
> http://customer.website.com/InventoryStatus.asp?cboSupplier=4541-134289
> &status=all&action-select=Query
> > 
> > For this step, add --load-cookies=FILE, where FILE is the same file
> > you specified to --save-cookies above.
> 
> 

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: Error: wget for Windows.

2003-10-08 Thread Jens Rösner
Hi Suhas!

> I am trying to use wget for Windows & get this message: "The ordinal 508 
> could not be located in the dynamic link library LIBEAY32.dll".

You are very probably using the wrong version of the SSL files.
Take a look at 
http://xoomer.virgilio.it/hherold/
Herold has nicely rearranged the links to 
wget binaries and the SSL binaries.
As you can see, different wget versions need 
different SSL versions-
Just download the matching SSL, 
everything else should then be easy :)

Jens



> 
> This is the command I am using:
> wget http://www.website.com --http-user=username 
> --http-passwd=password
> 
> I have the LIBEAY32.dll file in the same folder as the wget. What could 
> be wrong?
> 
> Thanks in advance.
> Suhas
> 

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: no-clobber add more suffix

2003-10-06 Thread Jens Rösner
Hi Sergey!

-nc does not only apply to .htm(l) files.
All files are considered.
At least in all wget versions I know of.

I cannot comment on your suggestion, to restrict -nc to a 
user-specified list of file types.
I personally don't need it, but I could imagine certain situations 
were this could indeed be helpful. 
Hopefully someone with more knowledge than me 
can elaborate a bit more on this :)

CU
Jens



> `--no-clobber' is very usfull option, but i retrive document not only with
> .html/.htm suffix.
> 
> Make addition option that like -A/-R define all allowed/rejected rules
> for -nc option.
> 

-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Bug in Windows binary?

2003-10-05 Thread Jens Rösner
Hi!

I downloaded 
wget 1.9 beta 2003/09/29 from Heiko
http://xoomer.virgilio.it/hherold/
along with the SSL binaries.
wget --help 
and 
wget --version 
will work, but 
any downloading like 
wget http://www.google.com
will immediately fail.
The debug output is very brief as well:

wget -d http://www.google.com
DEBUG output created by Wget 1.9-beta on Windows.

set_sleep_mode(): mode 0x8001, rc 0x8000

I disabled my wgetrc as well and the output was exactly the same.

I then tested 
wget 1.9 beta 2003/09/18 (earlier build!)
from the same place and it works smoothly.

Can anyone reproduce this bug?
System is Win2000, latest Service Pack installed.

Thanks for your assistance and sorry if I missed an 
earlier report of this bug, I know a lot has been done over the last weeks 
and I may have missed something.
Jens



-- 
NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien...
Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService

Jetzt kostenlos anmelden unter http://www.gmx.net

+++ GMX - die erste Adresse für Mail, Message, More! +++



Re: The Dynamic link Library LIBEAY32.dll

2003-01-14 Thread Jens Rösner
Hi Stacee, 

a quick cut'n'paste into google revealed the following page:
http://curl.haxx.se/mail/archive-2001-06/0017.html

Hope that helps
Jens


> Stacee Kinney wrote:
> 
> Hello,
> 
> I installed Wget.exe on a Windows 2000 system and has setup Wget.exe
> to run a maintenance file on an hourly bases. However, I am getting
> the following error.
> 
> wget.exe - Unable to Locate DLL
> 
> The dynamic link library LIBEAY32.dll could not be found in the
> specified path
> 
>C:\WINNT;,;C:\WINNT\System32;C:\WINNT\system;c:\WINNT;C:\Perl\bin;C\WINNT\system32;C;WINNT;C:\WINNT\system32\WBEM.
> 
> I am not at all knowledgeable about Wget and just tried to follow
> instructions for its installation to run the maintenance program.
> Could you please help me with this problem and the DLL file Wget is
> looking for?
> 
> Regards
> Stacee



Re: wget -m imply -np?

2002-12-30 Thread Jens Rösner
Hi Karl!

>From my POV, the current set-up is the best solution. 
Of course, I am also no developer, but an avid user.
Sometimes you just don't know the structure of the website 
in advance, so using -m as a trouble-free no-brainer 
will get you the complete site neatly done with timestamps.
BTW, -m is an abbreviation:
-m = -r -l0 -N IIRC
If you _know_ that you don't want to grab upwards, 
just add -np and you're done. Otherwise someone 
would have to come up with a switch to disable the default -np 
that you suggested or the user would have to rely on 
the single options that -m is made of - hassle without benefit.
You furthermore said:
"generally, that leads to the whole Internet"
That is wrong, if I understand you correctly. 
Wget will always stay at the start-host, except when you 
allow different hosts via a smart combination of 
-D -H -I 
switches.

H2H
Jens


Karl Berry wrote:
> 
> I wonder if it would make sense for wget -m (--mirror) to imply -np
> (--no-parent).  I know that I, at least, have no interest in ever
> mirroring anything "above" the target url(s) -- generally, that leads to
> the whole Internet.  An option to explicitly include the parents could
> be added.
> 
> Just a suggestion.  Thanks for the great software.
> 
> [EMAIL PROTECTED]



Re: Improvement: Input file Option

2002-10-08 Thread Jens Rösner

Hi Pi!

Copied straight from the wget.hlp:

#

-i file
--input-file=file

Read URLs from file, in which case no URLs need to be on the command
line.  If there are URLs both on the command line and in an input file,
those on the command lines will be the first ones to be retrieved.  The
file need not be an HTML document (but no harm if it is)--it is enough
if the URLs are just listed sequentially.

However, if you specify --force-html, the document will be regarded as
html.  In that case you may have problems with relative links, which you
can solve either by adding  to the documents or by
specifying --base=url on the command line.

-F
--force-html

When input is read from a file, force it to be treated as an HTML file. 
This enables you to retrieve relative links from existing HTML files on
your local disk, by adding  to HTML, or using the
--base command-line option.

-B URL
--base=URL

When used in conjunction with -F, prepends URL to relative links in the
file specified by -i.

#

I think that should help, or I am missing your point.

CU
Jens


Thomas Otto wrote:
> 
> Hi!
> 
> I miss an option to use wget with a local html file that I have
> downloaded and maybe already edited. Wget should take this file plus the
> option where this file originally came from and take this file instead
> of the first document it gets after connecting.
> 
>-Thomas



Re: -p is not respected when used with --no-parent

2002-09-20 Thread Jens Rösner

Hi Dominic!

Since wget 1.8, the following should be the case:
*
*** When in page-requisites (-p) mode, no-parent (-np) is ignored when
retrieving for inline images, stylesheets, and other documents needed
to display the page.
**
(Taken from the included news file of wget 1.8.1)

I however remember that I once had the same problem, 
that -p -np will only get page requsites under or at the current
directory.
I currently run wget 1.9-beta and haven't seen the problem yet.

CU
Jens


> Dominic Chambers wrote:
> 
> Hi again,
> 
> I just noticed that one of the inline images on one of the jobs I did
> was not included. I looked into this, and it was because it was it
> outside the scope that I had asked to remain within using --no-parent.
> So I ran the job again, but using the -p option to ensure that I kept
> to the right pages, but got all the page requesites regardless.
> 
> However, this had no effect, and I therefore assume that this option
> is not compatible with the --no-parent option:
> 
> wget -p -r -l0 -A htm,html,png,gif,jpg,jpeg
> --no-parent http://java.sun.com/products/jlf/ed2/book/index.html
> 
> Hope that helps, Dominic.



Re: user-agent string

2002-09-02 Thread Jens Rösner

Hi Jakub!

"But I get the same files as running this coomand without using
user-agent
string."

What is wrong with the files you get?
Do you not get all the files?
Many servers (sites) do not make a difference what 
user-agent accesses them. So the files will not differ.
If you know that you don't get all the files (or the wrong ones), 
it maybe that you should ignore robots by the 
wgetrc command
robots = on/off
or you need a special referrer if you want 
to start in the "middle" of the site.

CU
Jens


" (Jakub Grosman)" wrote:
> 
> Hi all,
> I am using wget a long time ago and it is realy great utility.
> I run wget 1.8.22 on redhat 7.3 and my problem is concerning the user agent
> string.
> I run this command:
> wget -m --user-agent="Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)" -l0 -H
> -Dsite -nd -np -Pdirectory http://site
> 
> But I get the same files as running this coomand without using user-agent
> string.
> Could someone explain me, what I am making not correct?
> 
> Thanks
> Jakub



Re: getting the correct links

2002-08-29 Thread Jens Rösner

Hi!

Max' hint is incorrect I think, as -m includes 
-N (timestamps) and -r (recursive)
Furthermore, I remember that wget http://www.host.com 
automatically defaults to recursive, not sure at the moment, sorry.

I think Christopher's problem is -nd
This means "no directories" and results in all 
files being written to the directory wget is started from 
(or via -P told to save to).
So, if I am right, all files, even from the server subdirectories 
are there, Chris, just not neatly saved to local subdirs.
Could you confirm this?
If so, just leave out -nd and it should work.
A single file is per default saved into the wget dir, with -x (force dirs) 
you can save it to the full path locally.
wget offers numerous ways to cut the path, 
please look it up in the manual, if interested.

CU
Jens

> Christopher Stone wrote:
> > Thank you all.
> > 
> > Now the issue seems to be that it only gets the root
> > directory.
> > 
> > I ran 'wget -km -nd http://www.mywebsite.com
> 
> -r
> 
> Max.
> 

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: getting the correct links

2002-08-28 Thread Jens Rösner

Hi Chris!

Using the -k switch (convert local files to relative links) 
should do what you want.

CU
Jens


Christopher Stone wrote:
> 
> Hi.
> 
> I am new to wget, and although it doesn't seem to
> difficult, I am unable to get the desired results that
> I am looking for.
> 
> I currently have a web site hosted by a web hosting
> site. I would like to take this web site as is and
> bring it to my local web server. Obviously, the ip
> address and all the links point back to this web
> server.
> 
> When I ran wget and sucked the site to my local box,
> it pulled all the pages down and the index page comes
> up fine, but when I click on a link, it goes back to
> the remote server.
> 
> What switch(s) do I use, so that when I pull the pages
> to my box, that all of the links are changed also?
> 
> Thank you.
> 
> Chris
> 
> please cc to me, as i am not a list subscriber.
> 
> __
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com



Re: I'm Sorry But...

2002-08-26 Thread Jens Rösner

I am not sure if I understand you correctly, but wget (in http mode) 
cannot view the directory listing.
This is only possible in ftp mode, but of course 
you can only use ftp on an ftp-server.
On an http server, wget can only follow from link to link 
or download a "hidden" file if you enter its URL.

CU
Jens


À̺´Èñ wrote:
> 
> I'm sorry, but I have no community to know about my question.
> 
> I want to know ...
> 
> How can I scan the Web Server's all data using the wget.
> 
> I want to get all files  under "DocumentRoot". (ex. when using apache Web Server ... 
>httpd.conf)
> 
> I want to get all files under "DocumentRoot" ALL FILES that not linked and just 
>exist under "DocumentRoot"
> 
> How can I do??
> 
> wget -r http://www.aaa.org  was not scan all files under "DocumentRoot"...
> 
> Best Regard



-r -R.exe downloads exe file

2002-08-04 Thread Jens Rösner

Hi!

With wget 1.9-beta, wget will download .exe files 
although they should be rejected (-r -R.exe).
After the download, wget removes the local file.
I understand that html files are downloaded even if -R.html,.htm 
is specified as the links that may be included in them 
have to be parsed.
However, I think, this makes no sense for .exe files 
and wanted to ask if this behaviour of wget 
maybe could get reconsidered.

Kind regards
Jens



Syntax for "exclude_directories"?

2002-07-27 Thread Jens Rösner

Hi guys!

Could someone please explain to me how to use 
-X (exclude_directories; --exclude) 
correctly on Windos machines?

I tried 
wget -X"/html" -x -k -r -l0 http://home.arcor.de/???/versuch.html
wget -X"html" -x -k -r -l0 http://home.arcor.de/???/versuch.html
wget -X html -x -k -r -l0 http://home.arcor.de/???/versuch.html
wget -X/html -x -k -r -l0 http://home.arcor.de/???/versuch.html
wget -X'/html' -x -k -r -l0 http://home.arcor.de/???/versuch.html

All will traverse into the http://home.arcor.de/???/html folder.
I also tried the wgetrc version with either quotes, slashes or 
combinations.

I had a look into the wget documentation html file, but could not find
my mistake.

I tried both wget 1.5 and 1.9-beta.

Kind regards
Jens



Re: speed units

2002-06-10 Thread Jens Rösner

Hi Joonas!

There was a lengthy discussion about this topic a few months ago.
I am pretty sure (=I hope) that noone wants to revamp this (again).
I personally think that if people start regarding this as 
a "bug" wget is damn close to absolute perfection.
(Yes, I know, perfection is per definitionem "complete", that is a
pleonasmus.)
If you are really interested, do 
a) a search in Google
b) a search in the wget Mailing list archive

CU
Jens


Joonas Kortesalmi wrote:
> 
> Wget seems top repots speeds with wrong units. It uses for example "KB/s"
> rather than "kB/s" which would be correct. Any possibility to fix that? :)
> 
> K = Kelvin
> k = Kilo
> 
> Propably you want to use small k with download speeds, right?
> 
> Thanks a lot anyways for such a great tool!
> Keep up the good work, free software rules the world!
> 
> --
> Joonas Kortesalmi <[EMAIL PROTECTED]>



Re: robots.txt

2002-06-10 Thread Jens Rösner

Hi Pike and the list!


> >> > or your indexing mech might loop on it, or crash the server. who knows.
> >> I have yet to find a site which forces wGet into a "loop" as you said.
> I have a few. And I have a few java servers on linux that really hog the
> machine when requested. They're up for testing.
Ok, I am sorry, I always thought that when something like this happens
the 
person causing the "loop" would suffer most and therefore be punished
directly. 
I did not imagine that the server could really go down in case of that
constellation.

> >> If the robots.txt said that no user-agent may access the page, you would
> >> be right.
> right. or if it says some page is only ment for one specific bot.
> these things have a reason.
Yes, you are right. I concluded from my own experience that most
robots.txt say:
"If you are a browser or google (and so on), go ahead, if you are
anything else, stop."
Allowing a certain bot to a bot-specific page was outside my scope.

CU
Jens



Re: robots.txt

2002-06-09 Thread Jens Rösner

Hi!

> >>> Why not just put "robots=off" in your .wgetrc?
> hey hey
> the "robots.txt" didn't just appear in the website; someone's
> put it there and thought about it. what's in there has a good reason.
Wll, from my own experience, the #1 reason is that webmasters 
do not want webgrabbers of any kind to download the site in order to
force 
the visitor to interactively browse the site and thus click
advertisement banners.

> The only reason is 
> you might be indexing old, doubled or invalid data, 
That is cute, someone who believes that all people in the 
internet do what they do to make life easier for everyone.
If you said "one reason is" or even "one reason might be", 
I would not be that cynical, sorry.

> or your indexing mech might loop on it, or crash the server. who knows.
I have yet to find a site which forces wGet into a "loop" as you said.
Others on the list probably can estimate the theoretical likelyhood of
such events.

> ask the webmaster or sysadmin before you 'hack' the site.
LOL!
hack! Please provide a serious definition of "to hack" that includes 
"automatically downloading pages that could be downloaded with any
interactive web-browser"
If the robots.txt said that no user-agent may access the page, you would
be right.
But then: How would anyone know of the existence of this page then?
[rant]
Then again, maybe the page has a high percentage of cgi, JavaScript,
iFrames and thus only allows 
IE 6.0.123b to access the site. Then wget could maybe slow down the
server, especially as it is 
probably a w-ows box :> But I ask: Is this a bad thing?
Whuahaha!
[/rant]

Ok, sorry vor my sarcasm, but I think you overestimate the benefits of
robots.txt for mankind.

CU
Jens



Re: A couple of wget newbie questions (proxy server and multiplefiles)

2002-05-16 Thread Jens Rösner

Hi Dale!

> Do I have to do 4 separate logins passing my username/passowrd each time?
> If not, how do I list the 4 separate directories I need to pull files from
> without performing 4 logins?
you should be able to put the four files into a .txt file and then 
use this txt-file with -i filename.txt
I use windows, so if you run Linux your fileextension may differ
(right?)
Also please note that I have neither used wget on a password-protected
site 
nor on ftp, so I may be wrong here.

> We are behind a firewall, I can't see how to pass the proxy server IP
> address to wget. And unfortunately, our IT group will not open up a hole for
> me to pull these files.
No problem, use
proxy = on
http_proxy = IP/URL
ftp_proxy = IP/URL
proxy_user = username
proxy_password = proxypass
in your wgetrc.
This is also included in the wget manual, 
but I, too, was too dumb to find it. ;)

CU
Jens



Re: Following absolute links wite wget

2002-05-05 Thread Jens Rösner

Hi S.G.,

If I am not completely asleep after yesterday, -np means --no-parent and 
forces a recursion ONLY down the directory tree, whereas 
../dir1/file.ext means in HTML: Go UP one level and then into the
directory dir1.
So, remove the -np and you should be fine.

CU
Jens


"S.G" wrote:
> 
> I am having trouble instructing wget to follow links
> such as .
> 
> Is there a certain way to go about following such
> links.
> 
> currently i am trying ./wget -r -np -L
> http://www.addr.com/filelist.htm
> 
> Thans for any and all help on this.



Re: Using Wget in breadth-first search (BFS) mode

2002-04-30 Thread Jens Rösner

Hi Evgeniy!

> ==> My question is whether this change is indeed operation and 
> stable for deployment. Has BFS indeed replaced the DFS
> as the primary recursion mechanism in Wget ?

>From my own personal experience, BFS is stable and the only recursion
mechanism for wget 1.8+.
The last version with DFS is 1.7.x also stable, but lacking some other
features of 1.8.x.
I don't know the scope of your project, but if you want to compare DFS and
BFS, I think comparing 1.7.1 and 1.8.1 is as close to ideal as it gets (maybe
ProLog maybe another, less useful but truer way of doing this ;)

CU
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: Ftp user/passwd

2002-04-25 Thread Jens Rösner

Hi Jim!

This is no bug, should have been sent to the normal wget-List, CC changed.
If I remember correctly, you can specify all .wgetrc commands on the command
line 
with the -e option.
I hope this is valid for "sensitive" data like logins, too.
I read that using -e "wgetrc switch" works on windows, note the "", don't
know about correct *nix syntax.

CU
Jens


> Hello,
> 
> I was wondering if it's possible to specify ftp-login and password on
> the commandline as well, instead of putting it in .wgetrc?
> 
> Thanks
> Jim
> 
> __
> Jim De Sitter
> BMB-unix team Amdahl Belgium
> Vooruitgangsstraat 55 Woluwedal 26/b4
> 1210 Brussel  1932 St. Stevens Woluwe
> Tel: +32 (0)2/205.24.13   Tel: +32 (0)2/715.03.00
> E-mail: [EMAIL PROTECTED]
> E-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
>  DISCLAIMER 
> 
> "This e-mail and any attachment thereto may contain information which is
> co
> nfidential and/or protected by intellectual property rights and are
> intende
> d for the sole use of the recipient(s) named above. 
> Any use of the information contained herein (including, but not limited
> to,
>  total or partial reproduction, communication or distribution in any form)
> by other persons than the designated recipient(s) is prohibited. 
> If you have received this e-mail in error, please notify the sender either
> by telephone or by e-mail and delete the material from any computer".
> 
> Thank you for your cooperation.
> 
> For further information about Proximus mobile phone services please see
> our
>  website at http://www.proximus.be or refer to any Proximus agent.
> 
> 

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: Feature request

2002-04-24 Thread Jens Rösner

Hi Brix!

> >> It also seems these options are incompatible:
> >> --continue with --recursive
[...]
> JR> How should wget decide if it needs to "re-get" or "continue" the file?
[...]
Brix: 
> Not wanting to repeat my post from a few days ago (but doing so nevertheless) the 
>one way
> without checking all files online is to have wget write the downloaded
> file into a temp file (like *.wg! or something) and renaming it only
> after completing the download. 

Sorry for not paying attention. 
It sounds like a good idea :)
But I am no coder...

CU
Jens



Re: Feature request

2002-04-24 Thread Jens Rösner

Hi Frederic!

> I'd like to know if there is a simple way to 'mirror' only the images
> from a galley (ie. without thumbnails).
[...]
I won't address the options you suggested, because I think they should 
be evaluated by a developper/coder.
However, as I often download galleries (and have some myself), I might be
able to give you a few hints:
Restricting files to be downloaded by
a) file-name
b) the directory they are in

To a):
-R*.gif,*tn*,*thumb*,*_jpg*,*small*
you get the picture I guess (pun not intended, but funny nevertheless). 
Works quite well. 

To b):
--reject-dir *thumb*

(I am not sure about the correct spelling/syntax, I currently have neither
wget nor winzip -or similar- on this machine, sorry!)

> It also seems these options are incompatible:
> --continue with --recursive
> This could be useful, imho.
IIRC, you are correct, but this is intentional. (right?)
You probably think of the case where during a recursive download, the
connection breaks and a large file is only partially downloaded.
I could imagine that this might be useful.
However, I see a problem when using timestamps, which normally require 
that a file be downloaded, if sizes local/on the server do not match, 
or the date on the server is newer. 
How should wget decide if it needs to "re-get" or "continue" the file?
You could probably to "smart guessing", but the chance of false decisions
persists.
As a matter of fact, the problem is also existing when using --continue on a
single file, but then it is the user's decision and the story is therefore
quite different (I think).

CU
Jens


-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: Validating cookie domains

2002-04-19 Thread Jens Rösner

Hi Ian!

> > This is amazingly stupid.
> It seems to make more sense if you subtract one from the number of
> periods.
That was what I thought, too.

> Could you assume that all two-letter TLDs are country-code TLDs and
> require one more period than other TLDs (which are presumably at
> least three characters long)?
No, I don't think so.
Take my sites, for example
http://www.ichwillbagger-ladida.de
http://ichwillbagger-ladida.de
(remove the -ladida)
both work.

Or -as another phenomena I found- take 
http://www.uvex-ladida.de
and 
http://uvex-ladida.de
(remove the -ladida)
They are different...

I hope I did not miss your point.

CU
Jens









-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: feature wish: switch to disable robots.txt usage

2002-04-10 Thread Jens Rösner

Hi!

Just to be complete, thanks to Hrvoje's tip, 
I was able to find 

-e command
--execute command
Execute command as if it were a part of .wgetrc (see Startup File.). 
A command thus invoked will be executed after the 
commands in .wgetrc, thus taking precedence over them.

I always wondered about that. *sigh*
I can now think about changing my wgetgui in this aspect :)

Thanks again
Jens


Hrvoje Niksic wrote:
> 
> Noel Koethe <[EMAIL PROTECTED]> writes:
> 
> > Ok got it. But it is possible to get this option as a switch for
> > using it on the command line?
> 
> Yes, like this:
> 
> wget -erobots=off ...



Re: feature wish: switch to disable robots.txt usage

2002-04-10 Thread Jens Rösner

Hi Noel!

Actually, this is possible.
I think at least since 1.7, probably even longer.
Cut from the doc:
robots = on/off
Use (or not) /robots.txt file (see Robots.). Be sure to know what you
are doing before changing the default (which is on).
Please note:
This is a (.)wgetrc-only command.
You cannot use it on the command line, if I am not mistaken.

CU
Jens


Noel Koethe wrote:
> 
> Hello,
> 
> is it possible to get a new option to disable the usage
> of robots.txt (--norobots)?
> 
> for example:
> I want to mirror some parts of http://ftp.de.debian.org/debian/
> but the admin have a robots.txt
> 
> http://ftp.de.debian.org/robots.txt
> User-agent: *
> Disallow: /
> 
> I think he want to protect his machine from searchengine
> spiders and not from users want to download files.:)
> 
> it would be great if I could use wget for this task but
> now its not possible.:(
> 
> Thanks alot.
> 
> --
> Noèl Köthe



Re: LAN with Proxy, no Router

2002-04-10 Thread Jens Rösner

Hi Ian!

> > wgetrc works fine under windows (always has)
> > however, .wgetrc is not possible, but 
> > maybe . does mean "in root dir" under Unix?
> 
> The code does different stuff for Windows. Instead of looking for
> '.wgetrc' in the user's home directory, it looks for a file called
> 'wget.ini' in the directory that contains the executable. This does
> not seemed to be mentioned anywhere in the documentation.
>
>From my own experience, you are right concerning the location wget searches
for wgetrc on Windows.
However, a file called "wgetrc" is sufficient.
In fact, wgetrc.ini will not be found and thus 
its options ignored.
 
CU
Jens


-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: LAN with Proxy, no Router

2002-04-09 Thread Jens Rösner

Hi!

Someone please slap me with a gigantic sledgehammer?!
*whump*
Thanks!
Oh man, how could I not see it?
I mean, I used the "index" search function in the wget.hlp file.
I should have searched the whole text.
Even with index search "proxies" is just one line above "proxy".
Oh well.

Here is the report:
Result:
Works fine under windows with firewall and proxy over LAN into the www.

How:
Just put 
http_proxy = http://proxy.server.com:1234/
into the wgetrc file.

Addition:
wgetrc works fine under windows (always has)
however, .wgetrc is not possible, but 
maybe . does mean "in root dir" under Unix?

Thanks anyway, I think I'll go to bed now, oh boy...

CU
Jens


Hrvoje Niksic wrote:
> 
> Jens Rösner <[EMAIL PROTECTED]> writes:
> 
> > Could someone please tell me, what
> > "the appropriate environmental variable" is
> > and how do I change it in Windows
> > or what else I need to do?
> 
> The variables are listed in the manual under Various->Proxies.  Here
> is the relevant part:
> 
> `http_proxy'
>  This variable should contain the URL of the proxy for HTTP
>  connections.
> 
> `ftp_proxy'
>  This variable should contain the URL of the proxy for FTP
>  connections.  It is quite common that HTTP_PROXY and FTP_PROXY are
>  set to the same URL.
> 
> `no_proxy'
>  This variable should contain a comma-separated list of domain
>  extensions proxy should _not_ be used for.  For instance, if the
>  value of `no_proxy' is `.mit.edu', proxy will not be used to
>  retrieve documents from MIT.
> 
> I'm no Windows expert, so someone else will need to explain how to set
> them up.
> 
> Another way is to tell Wget where the proxies are in its own config
> file, `.wgetrc'.  I'm not entirely sure how that works under Windows,
> but you should be able to create a `.wgetrc' file in your home
> directory and insert something like this:
> 
> use_proxy = on
> http_proxy = http://proxy.server.com:1234/
> ftp_proxy = http://proxy.server.com:1234/
> proxy_user = USER
> proxy_passwd = PASSWD



LAN with Proxy, no Router

2002-04-09 Thread Jens Rösner

Hi!

I recently managed to get my "big" machine online using a two PC 
(Windows boxes) LAN. 
A PI is the server, running both Zonealaram and Jana under Win98.
The first one a firewall, the second one a proxy programme.
On my client, an Athlon 1800+ with Windows 2000 
I want to work with wget and download files over http from the www.

For Netscape, I need to specify the LAN IP of the server as Proxy
address.
Setting up LeechFTP works similarly, IE is also set up (all three work).
But wget does not work the way I "tried".
I just basically started it, it failed (of course) 
and I searched the wget help and the www with google.
However, the only thing that looks remotely like what I need is

''
-Y on/off
--proxy=on/off

Turn proxy support on or off.  The proxy is on by default 
if the appropriate environmental variable is defined.
''

Could someone please tell me, what 
"the appropriate environmental variable" is
and how do I change it in Windows
or what else I need to do?

I'd expect something like
--proxy=on/off
--proxy-address
--proxy-user
--proxy-passwd
as a collection of proxy-related commands.
All except --proxy-address=IP exist, so it is apparently not necessary.

Kind regards
Jens



Re: wget usage

2002-04-05 Thread Jens Rösner

Hi Gérard!

I think you should have a look at the -p option.
It stands for "page requisites" and should do exactly what you want.
If I am not mistaken, -p was introduced in wget 1.8 
and improved for 1.8.1 (the current version).

CU
Jens

> I'd like to download a html file with its embedded
> elements (e.g. .gif files).

[PS: CC changed to the normal wget list]



Re: cuj.com file retrieving fails -why?

2002-04-03 Thread Jens Rösner

Hallo Markus!

This is not a bug (I reckon) and should therefore have been sent to 
the normal wget list.

Using both wget 1.7.1 and 1.8.1 on Windows the file is 
downloaded with 

wget -d -U "Mozilla/5.0 (compatible; Konqueror/2.2.1; Linux)" -r
http://www.cuj.com/images/resource/experts/alexandr.gif

as well as with

wget http://www.cuj.com/images/resource/experts/alexandr.gif

So, I do not know what your problem is, but is neither wget#s nor cuj's
fault, AFAICT.

CU
Jens


> 
> This problem is independent on whether a proxy is used or not:
> The download hangs, though I can read the content using konqueror.
> So what do cuj people do to inhibit automatic download and how can
> I circumvent it?
> 
> 
> wget --proxy=off -d -U "Mozilla/5.0 (compatible; Konqueror/2.2.1; Linux)"
> -r http://www.cuj.com/images/resource/experts/alexandr.gif
> DEBUG output created by Wget 1.7 on linux.
> 
> parseurl ("http://www.cuj.com/images/resource/experts/alexandr.gif";) ->
> host www.cuj.com -> opath images/resource/experts/alexandr.gif -> dir
> images/resource/experts -> file alexandr.gif -> ndir
> images/resource/experts
> newpath: /images/resource/experts/alexandr.gif
> Checking for www.cuj.com in host_name_address_map.
> Checking for www.cuj.com in host_slave_master_map.
> First time I hear about www.cuj.com by that name; looking it up.
> Caching www.cuj.com <-> 66.35.216.85
> Checking again for www.cuj.com in host_slave_master_map.
> --14:32:35--  http://www.cuj.com/images/resource/experts/alexandr.gif
>=> `www.cuj.com/images/resource/experts/alexandr.gif'
> Connecting to www.cuj.com:80... Found www.cuj.com in
> host_name_address_map: 66.35.216.85
> Created fd 3.
> connected!
> ---request begin---
> GET /images/resource/experts/alexandr.gif HTTP/1.0
> User-Agent: Mozilla/5.0 (compatible; Konqueror/2.2.1; Linux)
> Host: www.cuj.com
> Accept: */*
> Connection: Keep-Alive
> 
> ---request end---
> HTTP request sent, awaiting response...
> 
> nothing happens 
> 
> 
> Markus
> 

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: option changed: -nh -> -nH

2002-04-03 Thread Jens Rösner

Hi Noèl!

-nh
and
-nH
are totally different.

from wget 1.7.1 (I think the last version to offer both):
`-nh'
`--no-host-lookup'
 Disable the time-consuming DNS lookup of almost all hosts (*note
 Host Checking::).

`-nH'
`--no-host-directories'
 Disable generation of host-prefixed directories.  By default,
 invoking Wget with `-r http://fly.srk.fer.hr/' will create a
 structure of directories beginning with `fly.srk.fer.hr/'.  This
 option disables such behavior.

For wget 1.8.x -nh became the default behavior.
Switching back to host-look-up is not possible.
I already complained that many old scripts now break and suggested 
that entering -nh at the command line would 
either be completely ignored or the user would be 
informed and wget executed nevertheless.
Apparently this was not regarded as useful.

CU
Jens


> 
> The option --no-host-directories
> changed from -nh to -nH (v1.8.1).
> 
> Is there a reason for this?
> It breaks a lot of scripts when upgrading,
> I think.
> 
> Could this be changed back to -nh?
> 
> Thank you.
> 
> -- 
>   Noèl Köthe
> 

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: spanning hosts

2002-04-02 Thread Jens Rösner

Hi!

Ian wrote:
> Well I only said the URLs specified on the command line or by the
> --include-file option are always downloaded. I didn't intend this
> to be interpreted as also applying to URLs which Wget finds while
> examining the contents of the downloaded html files. At the moment,
> the domain acceptance/rejection checks are only performed when
> downloaded html files are examined for further URLs to be
> downloaded (for the --recursive and --page-requisites options),
> which is why it behaves as it does.

Ah! Now I understand, thanks for explaining again.


[host wilödcards]
> > -Dbar.com behaves strictly: www.bar.com, www2.bar.com
> > -D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com
> > -D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com,
> > sex-bar.computer-dating.com
[...]
> It sounds like it should work okay. I'd prefer to let -Dbar.com
> also match fubar.com for compatibility's sake. If you wanted to
> match www.bar.com and www2.bar.com, but not www.fubar.com you
> could use -D.bar.com, but that wouldn't work if you wanted to
> match bar.com without the www (well, a leading . could be treated
> as a special case).

Sounds a bit more complicated to programme (that's why I did not suggest
it), 
but I must admit I am a fan of 
backwards compatibility :) so your version sounds like a good idea.

> It would be easiest and more consistent (currently) to use
> "shell-globbing" wildcards (as used for the file-acceptance
> rules) rather than grep/egrep-style wildcards.

Well, you got me once again.
Google found this page:
http://www.mkssoftware.com/docs/man1/grep.1.asp
Do I understand correctly that grep/egrep enables the user/programme to
search 
files (strings/records?) for a string expression?
While it appears (to me) to be more powerful than the mentioned
wildcards, 
I do not see the compelling reason to use it, as I think that wildcard
matching 
will work as well (apart from the consistency reason you mentioned).

CU
Jens



Re: spanning hosts

2002-03-28 Thread Jens Rösner

Howdy!

> > I came across a crash caused by a cookie
> > two days ago. I disabled cookies and it worked.
> I'm hoping you had debug output on when it crashed, otherwise this
> is a different crash to the one I already know about. Can you
> confirm this, please?

Yes, I had debug output on.

> > wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d
> > -R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html
> >
> > Result with 1.8.1 and 1.7.1 with -nh:
> > audistory.com: Only index.html
> > audistory.de: Everything
> > audi100-online: only the first page
> > kolaschnik.de: only the first page
> 
> Yes, that's how I thought it would behave. Any URLs specified on
> the command line or in a --include-file file are always downloaded
> irregardless of the domain acceptance rules. 

Well, one page of a rejected URL is downloaded, not more.
Whereas the only accepted domain audistory.de gets downloaded
completely.
Doesn't this differ from what you just said?


> One of the changes you
> desire is that the domain acceptance rules should apply to these
> too, which sounds like a reasonable thing to expect.

That's my impression, too (obviously ;)


> > What I would have liked and expected:
> > audistory.com: Everything
> > audistory.de: Everything
> > audi100-online: Everything
> > kolaschnik.de: nothing
> 
> That requires the first change and also different domain matching
> rules. I don't think that should be changed without adding extra
> options to do so. The --domains and --exclude-domains lists are
> meant to be actual domain names. I.e. -Dbar.com is meant to match
> bar.com and foo.bar.com, and it's just a happy accident that it
> also matches fubar.com (perhaps it shouldn't, really). I think if
> someone specified -Dbar.com and it matched
> sex-bar.computer-dating.com, they might be a bit surprised!

Agreed! How about introducing "wildcards" like 
-Dbar.com behaves strictly: www.bar.com, www2.bar.com
-D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com
-D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com,
sex-bar.computer-dating.com
That would leave current command lines operational 
and introduce many possibilities without (too much) fuss.
Or have I overlooked anything here?

> > Independent from the the question how the string "audi"
> > should be matched within the URL, I think rejected URLs
> > should not be parsed or be retrieved.
> 
> Well they are all parsed before it is decided whether to retrieve
> them or not!

Ooopsie again. /me looks up "parse"
parse=analyse
Yes, I understand now!

Kind regards
Jens



Re: spanning hosts: 2 Problems

2002-03-28 Thread Jens Rösner

Hi again, Ian and fellow wgeteers!

> A debug log will be useful if you can produce one. 
Sure I (or wget) can and did.
It is 60kB of text. Zipping? Attaching?

> Also note that if receive cookies that expire around 2038 with
> debugging on, the Windows version of Wget will crash! (This is a
> known bug with a known fix, but not yet finalised in CVS.)
Funny you mention that! 
I came across a crash caused by a cookie 
two days ago. I disabled cookies and it worked.
Should have traced this a bit more.

> > I just installed 1.7.1, which also works breadth-first.
> (I think you mean depth-first.) 
*doh* /slaps forehead
Of course, thanks.

> used depth-first retrieval. There are advantages and disadvantages
> with both types of retrieval.
I understand, I followed (but not totally understood) 
the discussion back then.

> > Of course, this is possible.
> > I just had hoped that by combining
> > -F -i url.html
> > with domain acceptance would save me a lot of time.
> 
> Oh, I think I see what your first complaint is now. I initially
> assumed that your local html file was being served by a local HTTP
> server rather than being fed to the -F -i options. Is your complaint really that URLs
> supplied on the command line or via the
> -i option are not subjected to the acceptance/rejection rules? That
> does indeed seem to be the current behavior, but there is not
> particular reason why we couldn't apply the tests to these URLs as
> well as the URLs obtained through recursion.

Well, you are confusing me a bit ;}
Assume a file like



http://www.audistory-nospam.com";>1
http://www.audistory-nospam.de";>2
http://www.audi100-online-nospam.de";>3
http://www.kolaschnik-nospam.de";>4



and a command line like

wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d
-R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html

Result with 1.8.1 and 1.7.1 with -nh: 
audistory.com: Only index.html
audistory.de: Everything
audi100-online: only the first page 
kolaschnik.de: only the first page

What I would have liked and expected:
audistory.com: Everything
audistory.de: Everything
audi100-online: Everything
kolaschnik.de: nothing

Independent from the the question how the string "audi" 
should be matched within the URL, I think rejected URLs 
should not be parsed or be retrieved.

I hope I could articulate what I wanted to say :)

CU
Jens



Re: spanning hosts: 2 Problems

2002-03-26 Thread Jens Rösner

Hi Ian!

[..]
> It's probably worth noting that the comparisons between the -D
> strings and the domains being followed (or not) is anchored at
> the ends of the strings, i.e. "-Dfoo" matches "bar.foo" but not
> "foo.bar".

*doh*
Thanks for the info.
I thought it would work similarly to the acceptance of files.


> > The first page of even the rejected hosts gets saved.
> That sounds like a bug.

Should I try to get a useful debug log? 
(It is Windows, so I do not know if it is helpful.)


[depth first]
> > Now, with downloading from many (20+) different servers, this is a bit
> > frustrating,
> > as I will probably have the first completely downloaded site in a few
> > days...
> 
> Would that be less of a problem if the first problem (first page
> >from rejected domains) was fixed?

Not really, the problems are quite different for me.


> 
> > Is there any other way to work around this besides installing wget 1.6
> > (or even 1.5?)
> No, 

I just installed 1.7.1, which also works breadth-first.
I now have two wget versions, no problem for me.


> but note that if you pass several starting URLs to Wget, it
> will complete the first before moving on to the second. That also
> works for the URLs in the file specified by the --input-file
> parameter. 

I know, I have used --input-file a great deal over the last few days.
It's great, but not really applicable in this circumstance,
as I did/do not want to manually extract the links from the html page.


> The other alternative is to run wget
> several times in sequence with different starting URLs and restrictions, perhaps 
>using
> the --timestamping or --no-clobber
> options to avoid downloading things more than once.

Of course, this is possible.
I just had hoped that by combining 
-F -i url.html
with domain acceptance would save me a lot of time.

It now works okay -with 1.7.1- but a domain acceptance/rejection like I
said 
would be helpful for me. But I reckon not for many other users (right?).

CU
Jens



spanning hosts: 2 Problems

2002-03-26 Thread Jens Rösner

Hi wgeteers!

I am using wget to parse a local html file which has numerous links into
the www.
Now, I only want hosts that include certain strings like 
-H -Daudi,vw,online.de
Two things I don't like in the way wget 1.8.1 works on windows:

The first page of even the rejected hosts gets saved.
This messes up my directory structure as I force directories 
(which is my default and normally useful)

I am aware that wget has switched to breadth first (as opposed to
depth-first) 
retrieval.
Now, with downloading from many (20+) different servers, this is a bit
frustrating, 
as I will probably have the first completely downloaded site in a few
days...
Is there any other way to work around this besides installing wget 1.6
(or even 1.5?)

Thanks
Jens



Re: (Fwd) Proposed new --unfollowed-links option for wget

2002-03-08 Thread Jens Rösner

Hi List!

As a non-wget-programmer I also think that this 
option may be very useful.
I'd be happy to see it wget soon :)
Just thought to drop in some positive feedback :)

CU
Jens

> >   -u,  --unfollowed-links=FILE  log unfollowed links to FILE.
> Nice. It sounds useful.



Re: maybe code from pavuk would help

2002-03-02 Thread Jens Rösner

Hi Noèl!

)message CC changed to normal wget list(

Rate-limiting is possible since wget 1.7.1 or so, please correct me if
it was 1.8!
requests for "http post" pop up occasionaly, 
but as far as I am concerned, I don't need it and 
I think it is not in the scope of wget currently.
Filling out forms could probably be very useful for some users I guess.
If it would be possible without too much fuss, 
I would encourage this, eventhough I would not need it.
BTW: Could you elaborate a bit more on the "..." part of your mail?
BTW2: Why did you send this to the bug list?(insert multiple question
marks here)

CU
Jens

Noel Koethe schrieb:
> 
> Hello,
> 
> I tested pavuk (http://www.pavuk.org/, GPL) and there are some features
> I miss in wget:
> 
> -supports HTTP POST requests
> -can automaticaly fill forms from HTML documents and make POST or GET
>  requestes based on user input and form content
> -you can limit transfer rate over network (speed throttling)
> ...
> 
> Maybe there is some code which could be used in wget.:)
> So the wheel wouldn't invented twice.
> 
> --
> Noèl Köthe



RE: Accept list

2002-02-28 Thread Jens Rösner

Hi Peter!
 
> I was using 1.5.3
> I am getting 1.8.1 now...
Good idea, but...

> > --accept="patchdiag.xref,103566*,103603*,103612*,103618*"
> > 112502.readme
> > 112504-01.zip
> > 112504.readme
> > 112518-01.zip
> > 112518.readme
[snip]
...look at the file names you want, none of them includes 103*, they all
start with 112*
So, wget works absolutely ok, I think.
Or am I missing something here?

CU
Jens

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




Re: -H suggestion

2002-01-16 Thread Jens Rösner

Hi Hrvoje!

> > requisites could be wherever they want. I mean, -p is already
> > ignoring -np (since 1.8?), what I think is also very useful.
> Since 1.8.1.  I considered it a bit more "dangerous" to allow
> downloading from just any host if the user has not allowed it
> explicitly.  

You are of course right.

> In either way, I was presented with a user interface problem.  I
> couldn't quite figure out how to arrange the options to allow for
> three cases:

Please don't hit me, if my suggestions do not follow wget naming
conventions

>  * -p gets stuff from this host only, including requisites.
I would make this the default -p behaviour, as it is the status quo.
(Great band btw)

>  * -p gets stuff from this host only, but requisites may span hosts.
How about --page-requisites-relaxed
Too long?
Do I understand correctly: --page-requisites-relaxed would 
let wget traverse only the base host, while the page requisites would
travel 
to hosts specified after -H -Dhost1.com,host2.com right?

>  * everything may span hosts.
Let -H ignore -p?
Ah, no *doh* doesn't work.
--page-requisites-open
?
But how would 
wget --page-requisites-open -H -Dhost1.com,host2.com URL
then be different from
wget -H -Dhost1.com,host2.com URL
?
And what if wget should travel host1.com and host2.com, but -p should 
only go to these two hosts and foo.com?
Ok, I think this problem may be a bit constructed.
And I am surely beginning to be confused at 3am.
Sorry.

CU
Jens



Re: How does -P work?

2002-01-14 Thread Jens Rösner

Hi Herold!

Thanks for the testing, I must admit, trying -nd did not occur to me :(

I already have implemented a \ to / conversion in my wgetgui, 
but forgot to strip the trailing (as Hrvoje suggested) / *doh*
Anyway, I would of course be happy to see a patch like you proposed, 
but I understand too little to judge where it belongs :}

CU
Jens

http://www.JensRoesner.de/wgetgui/


> Note: tests done on NT4. W9x probably would behave different (even
> worse).
> starting from (for example) c:, with d: being another writable disk of
> some kind, something like
> wget -nd -P d:/dir http://www.previnet.it
> does work as expected.
> wget -nd -P d:\dir http://www.previnet.it
> also does work as expected.
> wget -P d:\dir http://www.previnet.it
> did create a directory d:\@5Cdir and started from there, in other words
> the \ is converted by wget since it doesn't recognize it as a valid
> local directory separator.
> wget -P d:/dir http://www.previnet.it
> failed in a way or another for the impossibility to create the correct
> directory or use it if already present.
[snip]



Re: How does -P work?

2002-01-14 Thread Jens Rösner

Hi Hrvoje!

> > Can I use -P (Directory prefix) to save files in a user-determinded
> > folder on another drive under Windows?
> 
> You should be able to do that.  Try `-P C:/temp/'.  Wget doesn't know
> anything about windows backslashes, so maybe that's what made it fail.
> 
The problem with / and \ was already solved, thx.
The syntax folder/ is incorrect for wget on windows, it will try to save
to 
folder//url :(
Here is what I got with
wget -nc -x -P c:/temp -r -l0 -p -np -t10 -d -o minusp.log
http://www.jensroesner.de

###
DEBUG output created by Wget 1.8.1-pre3 on Windows.

Enqueuing http://www.jensroesner.de/ at depth 0
Queue count 1, maxcount 1.
Dequeuing http://www.jensroesner.de/ at depth 0
Queue count 0, maxcount 1.
--02:23:49--  http://www.jensroesner.de/
   => `c:/temp/www.jensroesner.de/index.html'
Resolving www.jensroesner.de... done.
Caching www.jensroesner.de => 212.227.109.232
Connecting to www.jensroesner.de[212.227.109.232]:80... connected.
Created socket 72.
Releasing 009A0730 (new refcount 1).
---request begin---
GET / HTTP/1.0

User-Agent: Wget/1.8.1-pre3

Host: www.jensroesner.de

Accept: */*

Connection: Keep-Alive



---request end---
HTTP request sent, awaiting response... HTTP/1.1 200 OK
Date: Mon, 14 Jan 2002 13:05:44 GMT
Age: 801
Server: Apache/1.3.22 (Unix)
Last-Modified: Sun, 28 Oct 2001 01:38:38 GMT
Accept-Ranges: bytes
Content-Type: text/html
Content-Length: 5371
Etag: "7f088c-14fb-3bdb619e"
Via: 1.1 NetCache (NetCache 4.1R5D2)


Length: 5,371 [text/html]
c:/temp/www.jensroesner.de: File
existsc:/temp/www.jensroesner.de/index.html: No such file or directory
Closing fd 72

Cannot write to `c:/temp/www.jensroesner.de/index.html' (No such file or
directory).

FINISHED --02:24:00--
Downloaded: 0 bytes in 0 files
###

Of course I have a c:/temp or c:\temp dir, 
but even if not, wget should create one, right?

CU
Jens



Re: Suggestion on job size

2002-01-11 Thread Jens Rösner

Hi Fred!

First, I think this would rather belong in the normal wget list, 
as I cannot see a bug here.
Sorry to the bug tracers, I am posting to the normal wget List and
cc-ing Fred, 
hope that is ok.

To your first request: -Q (Quota) should do precisely what you want.
I used it with -k and it worked very well.
Or am I missing your point here?

Your second wish is AFAIK not possible now.
Maybe in the future wget could write the record 
of downloaded files in the appropriate directory.
After exiting wget, this file could then be used 
to process all the files mentioned in it.
Just an idea, I would normally not think that 
this option is an often requested one.
HOWEVER: 
-K works (when I understand it correctly) on the fly, as it decides on
the run, 
if the server file is newer, if a previously converted file exists and
what to do.
So, only -k would work after the download, right?

CU
Jens

http://www.JensRoesner.de/wgetgui/

> It would be nice to have some way to limit the total size of any job, and
> have it exit gracefully upon reaching that size, by completing the -k -K
> process upon termination, so that what one has downloaded is "useful."  A
> switch that would set the total size of all downloads --total-size=600MB
> would terminate the run when the total bytes downloaded reached 600 MB, and
> process the -k -K.  What one had already downloaded would then be properly
> linked for viewing.
> 
> Probably more difficult would be a way of terminating the run manually
> (Ctrl-break??), but then being able to run the -k -K process on the
> already-downloaded files.
> 
> Fred Holmes



How does -P work?

2002-01-05 Thread Jens Rösner

Hi!

Can I use -P (Directory prefix) to save files in a user-determinded
folder on another drive under Windows?
I tried -PC:\temp\ which does not work (I am starting from D:\)
Also -P..\test\ would not save into the dir above the current one.
So I changed the \ into / and it worked.
However, I still could not save to another drive with -Pc:/temp
Any way around this? Bug/Feature? Windows/Unix problem?

CU
Jens



Re: -nh broken 1.8

2001-12-24 Thread Jens Rösner

Hi!

> > 2. Wouldn't it be a good idea to mention the 
> > deletion of the -nh option in a file?
> 
> Maybe.  What file do you have in mind?

First and foremost 
the "news" file, but I think it would also not be misplaced in 
wget.html and/or wget.hlp /.info (whatever it is called on Unix
systems).


> > 3. on a different aspect:
> > All command lines with -nh that were created before 1.8 are now
> > non-functional,
> Those command lines will need to be adjusted for the new Wget.  This
> is sad, but unavoidable.  Wget's command line options don't change
> every day, but they are not guaranteed to be cast in stone either.

I don't expect them to be lasting forever, 
I just meant that simply ignoring -nh in wget 1.8 
would have been an easy way to avoid the hassle.
I am of course thinking of my wGetGUI, where "no host look-up" is an
option.
So, I no have either to explain every user (not reading any manual of
course), 
that s/he should only use this option if they have an old wget version.
Or I could simply delete the -nh option and say that it is not important
enough 
for all the users of old wget versions.
And then there is the problem if someone upgrades from an old wget to a
new 
one, but keeps his/her old copy of wgetgui, which now of course produces 
invalid 1.8 command lines :(

CU
Jens

http://www.jensroesner.de/wgetgui/



Re: -nh broken 1.8

2001-12-24 Thread Jens Rösner

Hi Hrvoje!

> > -nh does not work in 1.8 latest windows binary.
> > By not working I mean that it is not recognized as a valid parameter.
> > (-nh is no-host look-up and with it on,
> > two domain names pointing to the same IP are treated as different)
> 
> You no longer need `-nh' to get that kind of behavior: it is now the
> default.

Ok, then three questions:

1. Is there then now a way to "turn off -nh"?
So that wget does not distinguish between domain names of the same IP?
Or is this option irrelevant given the net's current structure?

2. Wouldn't it be a good idea to mention the d
eletion of the -nh option in a file? 
Or was it already mentiones and I am too blind/stupid?

3. on a different aspect:
All command lines with -nh that were created before 1.8 are now
non-functional, 
except for the old versions of course.
Would it be possible that new wget versions just ignore it and 
older versions still work.
This would greatly enhance (forward) compatibility between different
versions, 
something I would regard as at least desirable?

CU
Jens



-nh broken 1.8

2001-12-24 Thread Jens Rösner

Hi!

I already posted this on the normal wget list, to which I am subscribed.
Problem:
-nh does not work in 1.8 latest windows binary.
By not working I mean that it is not recognized as a valid parameter.
(-nh is no-host look-up and with it on, 
two domain names pointing to the same IP are treated as different)

I am not sure which version first had this problem, but 1.7 did not show
it.
I really would like to have this option back.
Does anyone know where it is gone to?
Maybe doing holidays?

CU
Jens

http://www.jensroesner.de/wgetgui/



-nh -nH??

2001-12-23 Thread Jens Rösner

Hi wgeteers!

I noticed that -nh (no host look-up) seems to be gone in 1.8.1.
Is that right?
At first I thought, "Oh, you fool, it is -nH, you mixed it up"
But, obviously, these are two different options.
I read the "news" file and the wget.hlp and wget.html but could not find
an answer.
I always thought that this option is quite important nowadays?!

Any help appreciated.

CU and a Merry Christmas
Jens



Re: referer question

2001-09-13 Thread Jens Rösner

Hi Vladi!

If you are using windows, you might try 
http://www.jensroesner.de/wgetgui/
it is a GUI for wGet written in VB 6.0.
If you click on the checkbox "identify as browser", wGetGUI
will create a command line like you want.
I use it and it works for me.
Hope this helps?

CU
Jens

Vladi wrote:
>   is it possible (well I mean easy way:)) to make wget pass referer auto?
>   I mean for every url that wget tries to fetch to pass hostname as
> referer.
[snip]

-- 
GMX - Die Kommunikationsplattform im Internet.
http://www.gmx.net




wGetGUI now under the GPL

2001-06-23 Thread Jens Rösner

Hi guys&gals! ;)

I just wanted to let you know, that with v0.5 of wGetGUI, it is now
released under the 
GPL, so if you feel like modifying or laughing at the source code, you
can now do so.

CU
Jens

http://www.jensroesner.de/wgetgui



Re: Download problem

2001-06-14 Thread Jens Rösner

Hi!

For all who cannot download the windows binaries,
they are now available through my site:
http://www.jensroesner.de/wgetgui/data/wget20010605-17b.zip
And while you are there, why not download wGetGUI v0.4?
:) http://www.jensroesner.de/wgetgui 
If Heiko is reading this:
May I just keep the file on my site?
And make it availabe to the public?

CU
Jens



Re: Download problem

2001-06-14 Thread Jens Rösner

Hi Chad!

Strange, it works for me with this link 
http://space.tin.it/computer/hherold/wget20010605-17b.zip
the old binary "1.6" is not availabe.
If you cannot download it (have you tried with wGet? :),
I can mail it to you, or if more people have the problem, add it
temporarily to my site.

CU
Jens

(w)get wGetGUI v0.4 at
http://www.jensroesner.de/wgetgui

> I'm still unable to download wget binary from
> http://space.tin.it/computer/hherold/ for either 1.6 or 1.7 .  Anyone have a
> good link?



(complete) GUI for those Windows users

2001-06-12 Thread Jens Rösner

Hi there!

First, let me introduce myself:
I am studying mechanical engineering and for a lecture I am learning
Visual Basic.
I was looking for a non-brain-dead way to get used to it and when a
friend of mine told me that he finds wGet too difficult to use I just
went *bing*
So, look what I have done:
http://www.jensroesner.de/wgetgui
Yes, it is a GUI and yes, it is not as powerful as the command line
execution.
I understand that most people who will read this are Unix/Linux users
and as that 
might have no use for my programme.
However, I would like to encourage you to send me any tips and bug
reports you find.
As I have not yet subscribed to the wget list, I would appreciate a CC
to my 
e-mail address.

Thanks!
Jens