Re: links conversion; non-existent index.html
> The problem was that that link: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ > instead of being properly converted to: > http://mineraly.feedle.com/Ftp/UpLoad/ Or, in fact, wget's default: http://mineraly.feedle.com/Ftp/UpLoad/index.html > was left like this on the main mirror page: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html > and hence while clicking on it: > "Not Found > The requested URL /Mineraly/Ftp/UpLoad/index.html was not found on this > server." Yup. So I assume that the problem you see is not that of wget mirroring, but a combination of saving to a custom dir (with --cut-dirs and the like) and conversion of the links. Obviously, the link to http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html which would be correct for a standard "wget -m URL" was carried over while the custom link to http://mineraly.feedle.com/Ftp/UpLoad/index.html was not created. My test with wget 1.5 just was a simple "wget15 -m -np URL" and it worked. So maybe the convert/rename problem/bug was solved with 1.9.1 This would also explain the "missing" gif file, I think. Jens -- +++ GMX - die erste Adresse für Mail, Message, More +++ 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
Re: links conversion; non-existent index.html
> > Wget saves a mirror to your harddisk. Therefore, it cannot rely on an > apache > > server generating a directory listing. Thus, it created an index.html as > > Apparently you have not tried to open that link, Which link? The non-working one on your incorrect mirror or the working one on my correct mirror on my HDD? > got it now? No need to get snappy, Andrzej. >From your other mail: > No, you did not understand. I run wget on remote machines. Ah! Sorry, missed that. > The problem is > solved though by running the 1.9.1 wget version. I still am wondering, because even wget 1.5 correctly generates the index.html from the server output, when called on my local box. I really do not know what is happening on your remote machine, but my wget 1.5 is able to mirror the site. It creates the Mineraly/Ftp/UpLoad/index.html file and the correct link to it. I understand that it is not what you want (having an index.html), but wget 1.5 creates a working mirror - as it is supposed to do. CU Jens -- +++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
Re: links conversion; non-existent index.html
Do I understand correctly that the mirror at feeble is created by you and wget? > > Yes, because this is in th HTML file itself: > > "http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html"; > > It does not work in a browser, so why should it work in wget? > It works in the browser: > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ > The is no index.html and the content of the directory is displayed. I assume I was confused by the different sites you wrote about. I was sure that both included the same link to ...index.html and the same gif-address. > http://znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/index.html > The link was not converted properly, it should be: > http://mineraly.feedle.com/Ftp/UpLoad/ > and it should be without any index.html, because there is none in the > original. Wget saves a mirror to your harddisk. Therefore, it cannot rely on an apache server generating a directory listing. Thus, it created an index.html as Tony Lewis explained. Now, _you_ uploaded (If I understood correctly) the copy from your HDD but did not save the index.html. Otherwise it would be there and it would work. Jens -- +++ GMX - die erste Adresse für Mail, Message, More +++ 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
Re: links conversion; non-existent index.html
> I know! But that is intentionally left without index.html. It should > display content of the directory, and I want that wget mirror it > correctly. > Similar situation is here: > http://chemfan.pl.feedle.com/arch/chemfanftp/ > it is left intentionally without index.html so that people could download > these archives. Is something wrong with my browser? This looks not like a simple directory listing, this file has formatting and even a background image. http://chemfan.pl.feedle.com/arch/chemfanftp/ looks the same as http://chemfan.pl.feedle.com/arch/chemfanftp/index.html in my Mozilla and wget downloads it correctly. > If wget put here index.html in the mirror of such site > then there will be no access to these files. IMO, this is not correct. index.html will include the info the directory listing contains at the point of download. This works for me with znik.wbc.lublin.pl/Mineraly/Ftp/UpLoad/ as well - what seemed to be problem according to your other post. > Well, if wget "has to" put index.html is such situations then wget is not > suitable for mirroring such sites, What exactly do you mean? It seems to work for me, e.g. index.html looks like the apache-generated directory listing. When mirroring, index.html will be re-written if/when it has changed on the server since the last mirroring. > and I expect that problem to be > corrected in future wget versions. You "expect"?? Jens -- +++ Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
Re: newbie question
Hi Alan! As the URL starts with https, it is a secure server. You will need to log in to this server in order to download stuff. See the manual for info how to do that (I have no experience with it). Good luck Jens (just another user) > I am having trouble getting the files I want using a wildcard > specifier (-A option = accept list). The following command works fine to get an > individual file: > > wget > https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/160RDTEN_FY06PB.pdf > > However, I cannot get all PDF files this command: > > wget -A "*.pdf" > https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/ > > Instead, I get: > > Connecting to 164.224.25.30:443 . . . connected. > HTTP request sent, awaiting response . . . 400 Bad Request > 15:57:52 ERROR 400: Bad Request. > >I also tried this command without success: > > wget > https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/*.pdf > > Instead, I get: > > HTTP request sent, awaiting response . . . 404 Bad Request > 15:57:52 ERROR 404: Bad Request. > > I read through the manual but am still having trouble. What am I > doing wrong? > > Thanks, Alan > > > -- +++ NEU: GMX DSL_Flatrate! Schon ab 14,99 EUR/Monat! +++ GMX Garantie: Surfen ohne Tempo-Limit! http://www.gmx.net/de/go/dsl
Re: wget 1.9.1 with large DVD.iso files
Hi Sanjay! This is a known issue with wget until 1.9.x. wget 1.10, which is currently in alpha status, fixes this problem. I do not know how much experience you have with this kind of stuff, but you could download the alpha source code and compile&test it. CU Jens (just another user) > wget 1.9.1 fails when trying to download a very large file. > > The download stopped in between and attempting to resume shows a negative > sized balance to be downloaded. > > e.g.ftp://ftp.solnet.ch/mirror/SuSE/i386/9.2/iso/SUSE-Linux-9.2-FTP-DVD.iso > 3284710 KB > -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
Re: File rejection is not working
Hi Jerry! AFAIK, RegExp for (HTML?) file rejection was requested a few times, but is not implemented at the moment. CU Jens (just another user) > The "-R" option is not working in wget 1.9.1 for anything but > specifically-hardcoded filenames.. > > file[Nn]ames such as [Tt]hese are simply ignored... > > Please respond... Do not delete my email address as I am not a > subscriber... Yet > > Thanks > > Jerry > -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl
Re: -X regex syntax? (repost)
Hi Vince! > So, so far these don't work for me: > > --exclude-directories='*.backup*' > --exclude-directories="*.backup*" > --exclude-directories="*\.backup*" Would -X"*backup" be OK for you? If yes, give it a try. If not, I think you'd need the correct escaping for the ".", but I have no idea how to do that, but http://mrpip.orcon.net.nz/href/asciichar.html lists %2E as the code. Does this work? CU Jens > > I've also tried this on my linux box running v1.9.1 as well. Same results. > Any other ideas? > > Thanks a lot for your tips, and quick reply! > > /vjl/ -- Lassen Sie Ihren Gedanken freien Lauf... z.B. per FreeSMS GMX bietet bis zu 100 FreeSMS/Monat: http://www.gmx.net/de/go/mail
Re: perhaps a Bug? "No such file or directory"
Moin Michael! http://www.heise.de/ttarif/druck.shtml?function=preis&Standort=0251&NEB=354&DAAuswahl=1234&CbCoA=alle&CbCmA=keine&Pre=keine&DA=folgende&Anbietername=False&Tarifnamen=True&Zonennamen=False&Netzzahl=True&Rahmen=True&Berechnung=real&Laenge=3&Abrechnung=False&Tag=1&Easy=True&0190=No&; > This link works in a browser! Ok, I can reproduce this error in Win2000, even with a quoted URL. The reason is -I guess- the filename length limit of 255 characters. The heise link has around 280! Workaround: wget -O tarife.html [Other Options] URL saves the file to tarife.html Good luck Jens (just another user) -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot
--cache=off: misleading manual?
Hi Wgeteers! I understand that -C off as short for --cache=off was dropped, right? However, the wget.html that comes with Herold's Windows binary mentions only --no-cache and the wgetrc option cache = on/off I just tried 1.9+cvs-dev-200404081407 unstable development version and --cache=off still works. I think this is not the latest cvs version and possibly the manual will be updated accordingly. But I think it would be nice to mention that --cache=off still works for backwards compatibility. I am aware that there are bigger tasks (LFS) currently, I just stumbled over this "issue" and thought I'd mention it. I hope I am not missing something! Jens
Re: wget 1.9.1
Hi Gerriet! > Only three images, which were referenced in styles.css, were missing. Yes, wget does not parse css or javascript. > I thought that the -p option "causes Wget to download all the files > that are necessary to properly display a given HTML page. This includes > such things as inlined images, sounds, and referenced stylesheets." The referenced stylesheet whatever.css should have been saved. > Yes, it does not explicitly mention images referenced in style sheets, > but it does claim to download everything "necessary to properly display > a given HTML page". I think this paragraph is misleading. As soon as JavaScript or CSS are involved in certain ways (like displaying images), -p will not be able to fully display the site. > So - is this a bug, no - it is a missing feature. > did I misunderstand the documentation, somehow > did I use the wrong options? kind of, but as the right options don't exist, you are not to blame ;) > Should I get a newer version of wget? 1.9.1 is the latest stable version according to http://wget.sunsite.dk/ CU Jens (just another user) -- GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail +++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++
Re: Recursion limit on 'foreign' host
Hi Manlio! If I remember correctly, this request has been brought up quite some time ago and quite many people thought that it is a good idea. But I think it has not been implemented in a patch until now. CU Jens > Hi. > wget is very powerfull and well designed, but there is a problem. > There is no way to limit the recursion depth on foreign host. > I want to use -m -H options because some sites have resources (binary > files) hosted on other host, > it would be nice to say: > > wget -m -H --external-level=1 ... > > > > Thanks and regards Manlio Perillo > -- GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail +++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++
Re: img dynsrc not downloaded?
dynsrc is Microsoft DHTML for IE, if I am not mistaken. As wget is -thankfully- not MS IE, it fails. I just did a quick google and it seems that the use of dynsrc is not recommended anyway. What you can do is to download http://www.wideopenwest.com/~nkuzmenko7225/Collision.mpg Jens (and before you ask, no I am not a developer of wget, just a user) > Hello. > Wget could not follow dynsrc tags; the mpeg file was not downloaded: > > at > http://www.wideopenwest.com/~nkuzmenko7225/Collision.htm > > Regards, > Juhana > -- +++ GMX DSL Premiumtarife 3 Monate gratis* + WLAN-Router 0,- EUR* +++ Clevere DSL-Nutzer wechseln jetzt zu GMX: http://www.gmx.net/de/go/dsl -- GMX ProMail mit bestem Virenschutz http://www.gmx.net/de/go/mail +++ Empfehlung der Redaktion +++ Internet Professionell 10/04 +++
Re: feature request: treating different URLs as equivalent
Hi Seb! I am not sure if I understand your problem completely, but if you don't mind, I'll try to help anyway. Have you tried wget --cut-dirs=1 --directory-prefix=www.foo.com --span-hosts --domains=www.foo.com,foo.com,www.foo.org www.foo.com I think that could work. Maybe you'll need to add --no-clobber? CU Jens > The other day I wanted to use wget to create an archive of the entire > www.cpa-iraq.org website. It turns out that http://www.cpa-iraq.org, > http://cpa-iraq.org, http://www.iraqcoalition.org and > http://iraqcoalition.org all contain identical content. Nastily, absolute > links to sub-URLs of all of those hostnames are sprinkled throughout the > site(s). > > To be sure of capturing the whole site, then, I need to tell wget to > follow links between the four domains I gave above. But because the site > does sometimes use relative links, that ends up with the site content > spread across 4 directories with much (but not complete) duplication > between them. This is wasteful and messy.
Re: Cannot WGet Google Search Page?
Hi Phil! Without more info (wget's verbose or even debug output, full command line,...) I find it hard to tell what is happening. However, I have had very good success with wget and google. So, some hints: 1. protect the google URL by enclosing it in " 2. remember to span (and allow only certain) hosts, otherwise, wget will only download google pages And lastly -but you obviously did so- think about restricting the recursion depth. Hope that helps a bit Jens > I have been trying to wget several levels deep from a Google search page > (e.g., http://www.google.com/search?=deepwater+oil). But on the very first > page, wget returns a 403 Forbidden error and stops. Anyone know how I can > get around this? > > Regards, Phil > Philip E. Lewis, P.E. > [EMAIL PROTECTED] > > -- "Sie haben neue Mails!" - Die GMX Toolbar informiert Sie beim Surfen! Jetzt aktivieren unter http://www.gmx.net/info
Re: Startup delay on Windows
[...] >Cygwin considers `c:\Documents and Settings\USERNAME' to be the >home directory. I wonder if that is reachable through registry... > > Does anyone have an idea what we should consider the home dir under > Windows, and how to find it? Doesn't this depend on each user's personal preference? I think most could live with c:\Documents and Settings\all users (or whatever it is called in each language) or the cygwin approach c:\Documents and Settings\USERNAME which will be less likely to conflict with security limits on multi-user PCs I think. I personally would like to keep everything wget-ish in the directory its exe is in and treat that as its home dir. BTW: Is this bug connected to the bug under Windows, that saving into another directory than wget's starting dir by using the -P (--directory-prefix) option does not work when switching drives? wget -r -P C:\temp URL will save to .\C3A\temp\*.* wget -r -P 'C:\temp\' URL will save to .\'C3A\temp\'\*.* wget -r -P "C:\temp\" URL does not work at all ('Missing URL') error however wget -r -P ..\temp2\ URL works like a charme. CU Jens -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
Re: skip robots
> You're close. You forgot the `-u' option to diff (very important), > and you snipped the beginning of the `patch' output (also important). Ok, I forgot the -u switch which was stupid as I actually read the command line in the patches file :( But concerning the snipping I just did diff > file.txt so I cannot have snipped anything. Is my shell (win2000) doing something wrongly or is the missing bit there now (when using the -u switch). Jens Once more: Patch sum up: a) Tell users how to --execute more than one wgetrc command b) Tell about and link to --execute when listing wgetrc commands. Reason: Better understanding and navigating the manual ChangeLog entry: Changed wget.texi concerning --execute switch to facilitate use and user navigation. Start patch: --- wget.texi Sun Nov 09 00:46:32 2003 +++ wget_mod.texi Sun Feb 08 20:46:07 2004 @@ -406,8 +406,10 @@ @itemx --execute @var{command} Execute @var{command} as if it were a part of @file{.wgetrc} (@pxref{Startup File}). A command thus invoked will be executed [EMAIL PROTECTED] the commands in @file{.wgetrc}, thus taking precedence over -them. [EMAIL PROTECTED] the commands in @file{.wgetrc}, thus taking precedence over +them. If you need to use more than one wgetrc command in your +command-line, use -e preceeding each. + @end table @node Logging and Input File Options, Download Options, Basic Startup Options, Invoking @@ -2147,8 +2149,9 @@ integer, or @samp{inf} for infinity, where appropriate. @var{string} values can be any non-empty string. -Most of these commands have command-line equivalents (@pxref{Invoking}), -though some of the more obscure or rarely used ones do not. +Most of these commands have command-line equivalents (@pxref{Invoking}). Any +wgetrc command can be used in the command-line by using the -e (--execute) (@pxref{Basic Startup Options}) switch. + @table @asis @item accept/reject = @var{string} -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
Re: skip robots
Hi Hrvoje! > In other words, save a copy of wget.texi, make the change, and send the > output of `diff -u wget.texi.orig wget.texi'. That's it. Uhm, ok. I found diff for windows among other GNU utilities at http://unxutils.sourceforge.net/ if someone is interested. > distribution. See > http://cvs.sunsite.dk/viewcvs.cgi/*checkout*/wget/PATCHES?rev=1.5 Thanks, I tried to understand that. Let's see if I understood it. Sorry if I am not sending this to the patches list, the document above says that it is ok to evaluate the patch with the general list. CU Jens Patch sum up: a) Tell users how to --execute more than one wgetrc command b) Tell about and link to --execute when explaining wgetrc commands. Reason: Better understanding and navigating the manual. ChangeLog entry: Changed wget.texi concerning --execute switch to facilitate use and user navigation. Start patch: 409,410c409,412 < @emph{after} the commands in @file{.wgetrc}, thus taking precedence over < them. --- > @emph{after} the commands in @file{.wgetrc}, thus taking precedence over > them. If you need to use more than one wgetrc command in your > command-line, use -e preceeding each. > 2150,2151c2152,2154 < Most of these commands have command-line equivalents (@pxref{Invoking}), < though some of the more obscure or rarely used ones do not. --- > Most of these commands have command-line equivalents (@pxref{Invoking}). Any > wgetrc command can be used in the command-line by using the -e (--execute) (@pxref{Basic Startup Options}) switch. > -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
Re: Why no -nc with -N?
Hi Dan, I must admit that I don't fully understand your question. -nc means no clobber, that means that files that already exist locally are not downloaded again, independent from their age or size or whatever. -N means that only newer files are downloaded (or if the size differs). So these two options are mutually exclusive. I could imagine that you want something like wget --no-clobber --keep-server-time URL right? If I understand the manual correctly, this date should normally be kept for http, at least if you specify wget URL I just tested this and it works for me. (With -S and/or -s you can print the http headers, if you need to.) However, I noticed that quite many servers do not provide a last-modified header. Did this answer your question? Jens > I'd love to have an option so that, when mirroring, it > will backup only files that are replaced because they > are newer on the source system (time-stamping). > > Is there a reason these can't be enabled together? > > __ > Do you Yahoo!? > Yahoo! SiteBuilder - Free web site building tool. Try it! > http://webhosting.yahoo.com/ps/sb/ > > I'd love to have an option so that, when mirroring, it > will backup only files that are replaced because they > are newer on the source system (time-stamping). > > Is there a reason these can't be enabled together? > > __ > Do you Yahoo!? > Yahoo! SiteBuilder - Free web site building tool. Try it! > http://webhosting.yahoo.com/ps/sb/ > -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
Re: skip robots
use robots = on/off in your wgetrc or wget -e robots = on/off URL in your command line Jens PS: One note to the manual editor(s?): The -e switch could be (briefly?) mentioned also at the "wgetrc commands" paragraph. I think it would make sense to mention it there again without clustering the manual too much. Currently it is only mentioned in "Basic Startup Options" (and in an example dealing with robots). Opinions? > I onced used the "skip robots" directive in the wgetrc file. > But I can't find it anymore in wget 1.9.1 documentation. > Did it disapeared from the doc or from the program ? > > Please answer me, as I'm not subscribed to this list > -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
Re: downloading multiple files question...
Hi Ron! If I understand you correctly, you could probably use the -A acclist --accept acclist accept = acclist option. So, probably (depending on your site), the syntax should be something like: wget -r -A *.pdf URL wget -r -A *.pdf -np URL or, if you have to recurse through multiple html files, it could be necessary/beneficial to wget -r -l0 -A *.pdf,*.htm* -np URL Hope that helps (and is correct ;) ) Jens > In the docs I've seen on wget, I see that I can use wildcards to > download multiple files on ftp sites. So using *.pdf would get me all > the pdfs in a directory. It seems that this isn't possible with http > sites though. For work I often have to download lots of pdfs when > there's new info I need, so is there any way to download multiple files > of the same type from an http web page? > > I'd like to be cc'd in replies to my post please as I'm not subscribed > to the mailing list. > -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
RE: apt-get via Windows with wget
Hi Heiko! > > Until now, I linked to your main page. > > Would you mind if people short-cut this? > Linking to the directory is bad since people would download Sorry, I meant linking directly to the "latest" zip. However, I personally prefer to read what the provider (in this case you) has to say about a download anyway. > Do link to the complete url if you prefer to, although I like to keep > some stats. Understood. > for example since start of the year > there have been 7 referrals from www.jensroesner.de/wgetgui Wow, that's massive... ...not! ;-) > Since that is about 0.05% stats shouldn't be > altered too much if you link directly to the archive ;-) Thanks for pointing that out ;-} > > What do you think about adding a "latest-ssl-libraries.zip"? > I don't think so. > If you get the "latest complete wget" archive those are included anyway > and you are sure it will work. Oh I'm very sorry, must have overread/misunderstood that. I thought the "latest" zip would not contain the SSLs. That's great. > I'd prefer to not force a unneeded (admittedly, small) download by bundling > the ssl libraries in every package. Very true. Especially as wget seems to be used by quite some people on slow connections. Kind regards Jens -- GMX ProMail (250 MB Mailbox, 50 FreeSMS, Virenschutz, 2,99 EUR/Monat...) jetzt 3 Monate GRATIS + 3x DER SPIEGEL +++ http://www.gmx.net/derspiegel +++
Re: problem with LF/CR etc.
Hi Hrvoje and all others, > > It would do away with multiple (sometimes obscure) options few > > users use and combine them in one. > You don't need bitfields for that; you can have an option like > `--strict-html=foo,bar,baz' where one or more of "foo", "bar" and > "baz" are recognized and internally converted to the appropriate > bitmasks. That way the user doesn't need to remember the numbers and > provide the sum. #slapsforeheadwithhand# Of course, thanks! > > I meant that in this case the user can still change wget's behaviour > > by using the "strict comment parsing" option. I think that > > contradicts what you said about wget dealing with a situation of bad > > HTML all by itself. > > I don't see the contradiction: handling bad HTML is *on* by default, > no user assistance is required. If the user requests > standard-compliant behavior, then HTML with broken comments will no > longer work, but the person who chose the option is probably well > aware of that. I thought you disliked the idea of a --lax-LFCR switch because of the additionally necessary user interaction _and_ the fact that it would create another option. To argue against the first point, I mentioned the --strict-comments switch (which requires similar user interaction) and for the latter problem I suggested combining several --lax-foo and --strict-foo switches into one. > I think you missed the point that I consider the --strict-foo (or > --lax-foo) switches undesirable, and that the comment parsing switch > is an intentional exception because the issue has been well-known for > years. I am aware that you are hesitating to implement even more options. And I think you are right to do so! In my first mail I just wanted to say, that _if_ it is really, really necessary to have another --lax-foo or --strict-foo switch, it could be combined with the already existing --strict-comments switch to yield a tidy command-line. I can't comment on the necessity of a --lax-LFCR switch in whatever appearance. I also don't know how difficult coding it would be. I just wanted to provide some ideas to implement it, if it turns out to be indeed necessary. Sorry for any turbulence I created :( CU Jens -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: problem with LF/CR etc.
Hi Hrvoje and all other, > It forces the user to remember what each number means, and then have > to add those numbers in his head. Unix utilities already have the > reputation of being user-unfriendly; why make things worse? It would do away with multiple (sometimes obscure) options few users use and combine them in one. Those users who use these options are (I would think) those who can deal with a bitcode option. All others would (have to) live with the default setting. I don't want to push anyone towards a bitcode by any means!! Just my $0.02 on options. > > But I think wget is already breaking this rule with the > > implementation of comment-parsing, or am I mistaken? > You are mistaken. Lax comment parsing is on by default, Sorry, I think I did not make myself understood: I know that lax comment parsing is "on" by default. I meant that in this case the user can still change wget's behaviour by using the "strict comment parsing" option. I think that contradicts what you said about wget dealing with a situation of bad HTML all by itself. Also, to my mind, the difference between --stricthtmlrules and --laxhtmlrules is just a negation of the syntax. > > We could make "full relaxation" the default and use the inverted > > option --stricthtmlrules, to exclude certain relaxations. This is > > probably more "automatic downloading"ish. > I don't like this idea because it means additional code to handle > something that noone really cares about anyway ("strict HTML"). Again I don't understand: What I suggested is more or less to generalize the "strict html comments" option towards other cases of wrong HTML. I really don't see a fundamental difference here. The better documentation and wider occurence of bad comments are reasons why "strict html comments" is a more needed option then to deal with LF/CR. If you say that an option for LF/CR s not needed, I trust you. But I still do not see a fundamental difference in implementation for the user. CU Jens -- GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen! Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken tolle Preise. http://www.gmx.net/de/cgi/specialmail/ +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: problem with LF/CR etc.
Hi, just an additional remark to my own posting. > > There is no easy way to punish the culprit. The only thing you can do > > in the long run is refuse to interoperate with something that openly > > breaks applicable standards. Otherwise you're not only rewarding the > > culprit, but destroying all the other tools because they will sooner > > or later collapse under the weight of kludges needed to support the > > broken HTML. > > I can't argue with that. > However, from the _user's_ point of view, _wget_ would seem to be broken, > as the user's webbrowser probably shows everything correctly. > If it is decided that wget does not consider links with LF/CR > in them, then IMHO, the user should get informed what happened. I just realized that this would mean that wget has to be able to detect the breaking of rules and then give a message to the user. That creates a situation where: a) wget has to be smart enough that _something_ is broken (and not just a 404) b) wget would ideally be so smart to know _what_ is broken c) the user thinks: Well, if wget knows what is wrong, why doesn't wget correct it? On the other hand, not giving even a brief message like "Invalid HTML code found, downloaded files may be unwanted." I don't know how to balance that :( CU Jens -- GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen! Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken tolle Preise. http://www.gmx.net/de/cgi/specialmail/ +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: problem with LF/CR etc.
Hi all! > There is no easy way to punish the culprit. The only thing you can do > in the long run is refuse to interoperate with something that openly > breaks applicable standards. Otherwise you're not only rewarding the > culprit, but destroying all the other tools because they will sooner > or later collapse under the weight of kludges needed to support the > broken HTML. I can't argue with that. However, from the _user's_ point of view, _wget_ would seem to be broken, as the user's webbrowser probably shows everything correctly. If it is decided that wget does not consider links with LF/CR in them, then IMHO, the user should get informed what happened. > > But if (for whatever reasons) an option is unavoidable, I would > > suggest something like > > --relax_html_rules #integer > > where integer is a bit-code (I hope that's the right term). > > This is not what GNU options usually look like and how they work > (underscores in option name, bitfields). underscores: Sorry, I just gave an example, I'm not a GNUer ;) bitfields: Ok. Any (short) reason for that? Is it consider as not transparent or as ugly? > But more importantly, I > really don't think this kind of option is appropriate. Wget should > either detect the brokenness and handle it automatically, or refuse to > acknowledge it altogether. The worst thing to do is require the user > to investigate why the HTML didn't parse, only to discover that Wget > in fact had the ability to process it, but didn't bother to do so by > default. Hm, well, I can see your point there. But I think wget is already breaking this rule with the implementation of comment-parsing, or am I mistaken? We could make "full relaxation" the default and use the inverted option --stricthtmlrules, to exclude certain relaxations. This is probably more "automatic downloading"ish. CU Jens -- GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen! Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken tolle Preise. http://www.gmx.net/de/cgi/specialmail/ +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: problem with LF/CR etc.
Hi! > Do you propose that squashing newlines would break legitimate uses of > unescaped newlines in links? I personally think that this is the main question. If it doesn't break other things, implement "squashing newlines" as the default behaviour. > Or are you arguing on principle that > such practices are too heinous to cater to by default? Well, if I may speak openly, I don't think wget should be a moralist here. If the fix is easy to implement and doesn't break things, let's do it. After all, ignoring these links does not punish the culprit (the HTML coder) but the innocent user, who expects that wget will download the site. > IMHO we should either cater to this by default or not at all. Agreed. But if (for whatever reasons) an option is unavoidable, I would suggest something like --relax_html_rules #integer where integer is a bit-code (I hope that's the right term). For example 0 = off 1 (2^0)= smart comment checking 2 (2^1)= smart line-break checking 4 (2^2)= option to come 8 (2^3)= another option to come So specifiying wget -m --relax_html_rules 0 URL would ensure strict HTML obeyance, while wget -m --relax_html_rules 15 URL would relax the above mentioned rules By using this bit-code, one integer is able to represent all combinations of relaxations by summing up the individual options. One could even think about wget -m --relax_html_rules inf URL to ensure that _all_ rules are relaxed, to be upward compatible with future wget versions. Whether --relax_html_rules inf or --relax_html_rules 0 or --relax_html_rules another-combination-that-makes-most-sense should be default, is up to negotiation. However, I would vote for complete relaxation. I hope that made a bit of sense Jens -- GMX Weihnachts-Special: Seychellen-Traumreise zu gewinnen! Rentier entlaufen. Finden Sie Rudolph! Als Belohnung winken tolle Preise. http://www.gmx.net/de/cgi/specialmail/ +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: How to send line breaks for textarea tag with wget
Hi Jing-Shin! > Thanks for the pointers. Where can I get a version that support > the --post-data option? My newest version is 1.8.2, but it doesn't > have this option. -JS Current version is 1.9.1. The wget site lists download options on http://wget.sunsite.dk/#downloading Good luck Jens -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: Web page "source" using wget?
Hi Hrvoje! > > retrieval, eventhough the cookie is there. I think that is a > > correct behaviour for a secure server, isn't it? > Why would it be correct? Sorry, I seem to have been misled by my own (limited) experience: >From the few secure sites I use, most will not let you log in again after you closed and restarted your browser or redialed your connection. That's what reminded my of Suhas' problem. > Even if it were the case, you could tell Wget to use the same > connection, like this: > wget http://URL1... http://URL2... Right, I always forget that, thanks! Cya Jens -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: Web page "source" using wget?
Hi Suhas! Well, I am by no means an expert, but I think that wget closes the connection after the first retrieval. The SSL server realizes this and decides that wget has no right to log in for the second retrieval, eventhough the cookie is there. I think that is a correct behaviour for a secure server, isn't it? Does this make sense? Jens > A slight correction the first wget should read: > > wget --save-cookies=cookies.txt > http://customer.website.com/supplyweb/general/default.asp?UserAccount=U > SER&AccessCode=PASSWORD&Locale=en-us&TimeZone=EST:-300&action-Submi > t=Login > > I tried this link in IE, but it it comes back to the same login screen. > No errors messages are displayed at this point. Am I missing something? > I have attached the "source" for the login page. > > Thanks, > Suhas > > > - Original Message - > From: "Suhas Tembe" <[EMAIL PROTECTED]> > To: "Hrvoje Niksic" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Monday, October 13, 2003 11:53 AM > Subject: Re: Web page "source" using wget? > > > I tried, but it doesn't seem to have worked. This what I did: > > wget --save-cookies=cookies.txt > http://customer.website.com?UserAccount=USER&AccessCode=PASSWORD&Loca > le=English (United States)&TimeZone=(GMT-5:00) Eastern Standard Time > (USA & Canada)&action-Submit=Login > > wget --load-cookies=cookies.txt > http://customer.website.com/supplyweb/smi/inventorystatus.asp?cboSupplier > =4541-134289&status=all&action-select=Query > --http-user=4542-134289 > > After executing the above two lines, it creates two files: > 1). "[EMAIL PROTECTED]" : I can see that > this file contains a message (among other things): "Your session has > expired due to a period of inactivity" > 2). "[EMAIL PROTECTED]" > > Thanks, > Suhas > > > - Original Message - > From: "Hrvoje Niksic" <[EMAIL PROTECTED]> > To: "Suhas Tembe" <[EMAIL PROTECTED]> > Cc: <[EMAIL PROTECTED]> > Sent: Monday, October 13, 2003 11:37 AM > Subject: Re: Web page "source" using wget? > > > > "Suhas Tembe" <[EMAIL PROTECTED]> writes: > > > > > There are two steps involved: > > > 1). Log in to the customer's web site. I was able to create the > following link after I looked at the section in the "source" as > explained to me earlier by Hrvoje. > > > wget > http://customer.website.com?UserAccount=USER&AccessCode=PASSWORD&Loca > le=English (United States)&TimeZone=(GMT-5:00) Eastern Standard Time > (USA & Canada)&action-Submit=Login > > > > Did you add --save-cookies=FILE? By default Wget will use cookies, > > but will not save them to an external file and they will therefore be > > lost. > > > > > 2). Execute: wget > > > > http://customer.website.com/InventoryStatus.asp?cboSupplier=4541-134289 > &status=all&action-select=Query > > > > For this step, add --load-cookies=FILE, where FILE is the same file > > you specified to --save-cookies above. > > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: Error: wget for Windows.
Hi Suhas! > I am trying to use wget for Windows & get this message: "The ordinal 508 > could not be located in the dynamic link library LIBEAY32.dll". You are very probably using the wrong version of the SSL files. Take a look at http://xoomer.virgilio.it/hherold/ Herold has nicely rearranged the links to wget binaries and the SSL binaries. As you can see, different wget versions need different SSL versions- Just download the matching SSL, everything else should then be easy :) Jens > > This is the command I am using: > wget http://www.website.com --http-user=username > --http-passwd=password > > I have the LIBEAY32.dll file in the same folder as the wget. What could > be wrong? > > Thanks in advance. > Suhas > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: no-clobber add more suffix
Hi Sergey! -nc does not only apply to .htm(l) files. All files are considered. At least in all wget versions I know of. I cannot comment on your suggestion, to restrict -nc to a user-specified list of file types. I personally don't need it, but I could imagine certain situations were this could indeed be helpful. Hopefully someone with more knowledge than me can elaborate a bit more on this :) CU Jens > `--no-clobber' is very usfull option, but i retrive document not only with > .html/.htm suffix. > > Make addition option that like -A/-R define all allowed/rejected rules > for -nc option. > -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Bug in Windows binary?
Hi! I downloaded wget 1.9 beta 2003/09/29 from Heiko http://xoomer.virgilio.it/hherold/ along with the SSL binaries. wget --help and wget --version will work, but any downloading like wget http://www.google.com will immediately fail. The debug output is very brief as well: wget -d http://www.google.com DEBUG output created by Wget 1.9-beta on Windows. set_sleep_mode(): mode 0x8001, rc 0x8000 I disabled my wgetrc as well and the output was exactly the same. I then tested wget 1.9 beta 2003/09/18 (earlier build!) from the same place and it works smoothly. Can anyone reproduce this bug? System is Win2000, latest Service Pack installed. Thanks for your assistance and sorry if I missed an earlier report of this bug, I know a lot has been done over the last weeks and I may have missed something. Jens -- NEU FÜR ALLE - GMX MediaCenter - für Fotos, Musik, Dateien... Fotoalbum, File Sharing, MMS, Multimedia-Gruß, GMX FotoService Jetzt kostenlos anmelden unter http://www.gmx.net +++ GMX - die erste Adresse für Mail, Message, More! +++
Re: The Dynamic link Library LIBEAY32.dll
Hi Stacee, a quick cut'n'paste into google revealed the following page: http://curl.haxx.se/mail/archive-2001-06/0017.html Hope that helps Jens > Stacee Kinney wrote: > > Hello, > > I installed Wget.exe on a Windows 2000 system and has setup Wget.exe > to run a maintenance file on an hourly bases. However, I am getting > the following error. > > wget.exe - Unable to Locate DLL > > The dynamic link library LIBEAY32.dll could not be found in the > specified path > >C:\WINNT;,;C:\WINNT\System32;C:\WINNT\system;c:\WINNT;C:\Perl\bin;C\WINNT\system32;C;WINNT;C:\WINNT\system32\WBEM. > > I am not at all knowledgeable about Wget and just tried to follow > instructions for its installation to run the maintenance program. > Could you please help me with this problem and the DLL file Wget is > looking for? > > Regards > Stacee
Re: wget -m imply -np?
Hi Karl! >From my POV, the current set-up is the best solution. Of course, I am also no developer, but an avid user. Sometimes you just don't know the structure of the website in advance, so using -m as a trouble-free no-brainer will get you the complete site neatly done with timestamps. BTW, -m is an abbreviation: -m = -r -l0 -N IIRC If you _know_ that you don't want to grab upwards, just add -np and you're done. Otherwise someone would have to come up with a switch to disable the default -np that you suggested or the user would have to rely on the single options that -m is made of - hassle without benefit. You furthermore said: "generally, that leads to the whole Internet" That is wrong, if I understand you correctly. Wget will always stay at the start-host, except when you allow different hosts via a smart combination of -D -H -I switches. H2H Jens Karl Berry wrote: > > I wonder if it would make sense for wget -m (--mirror) to imply -np > (--no-parent). I know that I, at least, have no interest in ever > mirroring anything "above" the target url(s) -- generally, that leads to > the whole Internet. An option to explicitly include the parents could > be added. > > Just a suggestion. Thanks for the great software. > > [EMAIL PROTECTED]
Re: Improvement: Input file Option
Hi Pi! Copied straight from the wget.hlp: # -i file --input-file=file Read URLs from file, in which case no URLs need to be on the command line. If there are URLs both on the command line and in an input file, those on the command lines will be the first ones to be retrieved. The file need not be an HTML document (but no harm if it is)--it is enough if the URLs are just listed sequentially. However, if you specify --force-html, the document will be regarded as html. In that case you may have problems with relative links, which you can solve either by adding to the documents or by specifying --base=url on the command line. -F --force-html When input is read from a file, force it to be treated as an HTML file. This enables you to retrieve relative links from existing HTML files on your local disk, by adding to HTML, or using the --base command-line option. -B URL --base=URL When used in conjunction with -F, prepends URL to relative links in the file specified by -i. # I think that should help, or I am missing your point. CU Jens Thomas Otto wrote: > > Hi! > > I miss an option to use wget with a local html file that I have > downloaded and maybe already edited. Wget should take this file plus the > option where this file originally came from and take this file instead > of the first document it gets after connecting. > >-Thomas
Re: -p is not respected when used with --no-parent
Hi Dominic! Since wget 1.8, the following should be the case: * *** When in page-requisites (-p) mode, no-parent (-np) is ignored when retrieving for inline images, stylesheets, and other documents needed to display the page. ** (Taken from the included news file of wget 1.8.1) I however remember that I once had the same problem, that -p -np will only get page requsites under or at the current directory. I currently run wget 1.9-beta and haven't seen the problem yet. CU Jens > Dominic Chambers wrote: > > Hi again, > > I just noticed that one of the inline images on one of the jobs I did > was not included. I looked into this, and it was because it was it > outside the scope that I had asked to remain within using --no-parent. > So I ran the job again, but using the -p option to ensure that I kept > to the right pages, but got all the page requesites regardless. > > However, this had no effect, and I therefore assume that this option > is not compatible with the --no-parent option: > > wget -p -r -l0 -A htm,html,png,gif,jpg,jpeg > --no-parent http://java.sun.com/products/jlf/ed2/book/index.html > > Hope that helps, Dominic.
Re: user-agent string
Hi Jakub! "But I get the same files as running this coomand without using user-agent string." What is wrong with the files you get? Do you not get all the files? Many servers (sites) do not make a difference what user-agent accesses them. So the files will not differ. If you know that you don't get all the files (or the wrong ones), it maybe that you should ignore robots by the wgetrc command robots = on/off or you need a special referrer if you want to start in the "middle" of the site. CU Jens " (Jakub Grosman)" wrote: > > Hi all, > I am using wget a long time ago and it is realy great utility. > I run wget 1.8.22 on redhat 7.3 and my problem is concerning the user agent > string. > I run this command: > wget -m --user-agent="Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)" -l0 -H > -Dsite -nd -np -Pdirectory http://site > > But I get the same files as running this coomand without using user-agent > string. > Could someone explain me, what I am making not correct? > > Thanks > Jakub
Re: getting the correct links
Hi! Max' hint is incorrect I think, as -m includes -N (timestamps) and -r (recursive) Furthermore, I remember that wget http://www.host.com automatically defaults to recursive, not sure at the moment, sorry. I think Christopher's problem is -nd This means "no directories" and results in all files being written to the directory wget is started from (or via -P told to save to). So, if I am right, all files, even from the server subdirectories are there, Chris, just not neatly saved to local subdirs. Could you confirm this? If so, just leave out -nd and it should work. A single file is per default saved into the wget dir, with -x (force dirs) you can save it to the full path locally. wget offers numerous ways to cut the path, please look it up in the manual, if interested. CU Jens > Christopher Stone wrote: > > Thank you all. > > > > Now the issue seems to be that it only gets the root > > directory. > > > > I ran 'wget -km -nd http://www.mywebsite.com > > -r > > Max. > -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: getting the correct links
Hi Chris! Using the -k switch (convert local files to relative links) should do what you want. CU Jens Christopher Stone wrote: > > Hi. > > I am new to wget, and although it doesn't seem to > difficult, I am unable to get the desired results that > I am looking for. > > I currently have a web site hosted by a web hosting > site. I would like to take this web site as is and > bring it to my local web server. Obviously, the ip > address and all the links point back to this web > server. > > When I ran wget and sucked the site to my local box, > it pulled all the pages down and the index page comes > up fine, but when I click on a link, it goes back to > the remote server. > > What switch(s) do I use, so that when I pull the pages > to my box, that all of the links are changed also? > > Thank you. > > Chris > > please cc to me, as i am not a list subscriber. > > __ > Do You Yahoo!? > Yahoo! Finance - Get real-time stock quotes > http://finance.yahoo.com
Re: I'm Sorry But...
I am not sure if I understand you correctly, but wget (in http mode) cannot view the directory listing. This is only possible in ftp mode, but of course you can only use ftp on an ftp-server. On an http server, wget can only follow from link to link or download a "hidden" file if you enter its URL. CU Jens À̺´Èñ wrote: > > I'm sorry, but I have no community to know about my question. > > I want to know ... > > How can I scan the Web Server's all data using the wget. > > I want to get all files under "DocumentRoot". (ex. when using apache Web Server ... >httpd.conf) > > I want to get all files under "DocumentRoot" ALL FILES that not linked and just >exist under "DocumentRoot" > > How can I do?? > > wget -r http://www.aaa.org was not scan all files under "DocumentRoot"... > > Best Regard
-r -R.exe downloads exe file
Hi! With wget 1.9-beta, wget will download .exe files although they should be rejected (-r -R.exe). After the download, wget removes the local file. I understand that html files are downloaded even if -R.html,.htm is specified as the links that may be included in them have to be parsed. However, I think, this makes no sense for .exe files and wanted to ask if this behaviour of wget maybe could get reconsidered. Kind regards Jens
Syntax for "exclude_directories"?
Hi guys! Could someone please explain to me how to use -X (exclude_directories; --exclude) correctly on Windos machines? I tried wget -X"/html" -x -k -r -l0 http://home.arcor.de/???/versuch.html wget -X"html" -x -k -r -l0 http://home.arcor.de/???/versuch.html wget -X html -x -k -r -l0 http://home.arcor.de/???/versuch.html wget -X/html -x -k -r -l0 http://home.arcor.de/???/versuch.html wget -X'/html' -x -k -r -l0 http://home.arcor.de/???/versuch.html All will traverse into the http://home.arcor.de/???/html folder. I also tried the wgetrc version with either quotes, slashes or combinations. I had a look into the wget documentation html file, but could not find my mistake. I tried both wget 1.5 and 1.9-beta. Kind regards Jens
Re: speed units
Hi Joonas! There was a lengthy discussion about this topic a few months ago. I am pretty sure (=I hope) that noone wants to revamp this (again). I personally think that if people start regarding this as a "bug" wget is damn close to absolute perfection. (Yes, I know, perfection is per definitionem "complete", that is a pleonasmus.) If you are really interested, do a) a search in Google b) a search in the wget Mailing list archive CU Jens Joonas Kortesalmi wrote: > > Wget seems top repots speeds with wrong units. It uses for example "KB/s" > rather than "kB/s" which would be correct. Any possibility to fix that? :) > > K = Kelvin > k = Kilo > > Propably you want to use small k with download speeds, right? > > Thanks a lot anyways for such a great tool! > Keep up the good work, free software rules the world! > > -- > Joonas Kortesalmi <[EMAIL PROTECTED]>
Re: robots.txt
Hi Pike and the list! > >> > or your indexing mech might loop on it, or crash the server. who knows. > >> I have yet to find a site which forces wGet into a "loop" as you said. > I have a few. And I have a few java servers on linux that really hog the > machine when requested. They're up for testing. Ok, I am sorry, I always thought that when something like this happens the person causing the "loop" would suffer most and therefore be punished directly. I did not imagine that the server could really go down in case of that constellation. > >> If the robots.txt said that no user-agent may access the page, you would > >> be right. > right. or if it says some page is only ment for one specific bot. > these things have a reason. Yes, you are right. I concluded from my own experience that most robots.txt say: "If you are a browser or google (and so on), go ahead, if you are anything else, stop." Allowing a certain bot to a bot-specific page was outside my scope. CU Jens
Re: robots.txt
Hi! > >>> Why not just put "robots=off" in your .wgetrc? > hey hey > the "robots.txt" didn't just appear in the website; someone's > put it there and thought about it. what's in there has a good reason. Wll, from my own experience, the #1 reason is that webmasters do not want webgrabbers of any kind to download the site in order to force the visitor to interactively browse the site and thus click advertisement banners. > The only reason is > you might be indexing old, doubled or invalid data, That is cute, someone who believes that all people in the internet do what they do to make life easier for everyone. If you said "one reason is" or even "one reason might be", I would not be that cynical, sorry. > or your indexing mech might loop on it, or crash the server. who knows. I have yet to find a site which forces wGet into a "loop" as you said. Others on the list probably can estimate the theoretical likelyhood of such events. > ask the webmaster or sysadmin before you 'hack' the site. LOL! hack! Please provide a serious definition of "to hack" that includes "automatically downloading pages that could be downloaded with any interactive web-browser" If the robots.txt said that no user-agent may access the page, you would be right. But then: How would anyone know of the existence of this page then? [rant] Then again, maybe the page has a high percentage of cgi, JavaScript, iFrames and thus only allows IE 6.0.123b to access the site. Then wget could maybe slow down the server, especially as it is probably a w-ows box :> But I ask: Is this a bad thing? Whuahaha! [/rant] Ok, sorry vor my sarcasm, but I think you overestimate the benefits of robots.txt for mankind. CU Jens
Re: A couple of wget newbie questions (proxy server and multiplefiles)
Hi Dale! > Do I have to do 4 separate logins passing my username/passowrd each time? > If not, how do I list the 4 separate directories I need to pull files from > without performing 4 logins? you should be able to put the four files into a .txt file and then use this txt-file with -i filename.txt I use windows, so if you run Linux your fileextension may differ (right?) Also please note that I have neither used wget on a password-protected site nor on ftp, so I may be wrong here. > We are behind a firewall, I can't see how to pass the proxy server IP > address to wget. And unfortunately, our IT group will not open up a hole for > me to pull these files. No problem, use proxy = on http_proxy = IP/URL ftp_proxy = IP/URL proxy_user = username proxy_password = proxypass in your wgetrc. This is also included in the wget manual, but I, too, was too dumb to find it. ;) CU Jens
Re: Following absolute links wite wget
Hi S.G., If I am not completely asleep after yesterday, -np means --no-parent and forces a recursion ONLY down the directory tree, whereas ../dir1/file.ext means in HTML: Go UP one level and then into the directory dir1. So, remove the -np and you should be fine. CU Jens "S.G" wrote: > > I am having trouble instructing wget to follow links > such as . > > Is there a certain way to go about following such > links. > > currently i am trying ./wget -r -np -L > http://www.addr.com/filelist.htm > > Thans for any and all help on this.
Re: Using Wget in breadth-first search (BFS) mode
Hi Evgeniy! > ==> My question is whether this change is indeed operation and > stable for deployment. Has BFS indeed replaced the DFS > as the primary recursion mechanism in Wget ? >From my own personal experience, BFS is stable and the only recursion mechanism for wget 1.8+. The last version with DFS is 1.7.x also stable, but lacking some other features of 1.8.x. I don't know the scope of your project, but if you want to compare DFS and BFS, I think comparing 1.7.1 and 1.8.1 is as close to ideal as it gets (maybe ProLog maybe another, less useful but truer way of doing this ;) CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: Ftp user/passwd
Hi Jim! This is no bug, should have been sent to the normal wget-List, CC changed. If I remember correctly, you can specify all .wgetrc commands on the command line with the -e option. I hope this is valid for "sensitive" data like logins, too. I read that using -e "wgetrc switch" works on windows, note the "", don't know about correct *nix syntax. CU Jens > Hello, > > I was wondering if it's possible to specify ftp-login and password on > the commandline as well, instead of putting it in .wgetrc? > > Thanks > Jim > > __ > Jim De Sitter > BMB-unix team Amdahl Belgium > Vooruitgangsstraat 55 Woluwedal 26/b4 > 1210 Brussel 1932 St. Stevens Woluwe > Tel: +32 (0)2/205.24.13 Tel: +32 (0)2/715.03.00 > E-mail: [EMAIL PROTECTED] > E-mail: [EMAIL PROTECTED] > > > > > DISCLAIMER > > "This e-mail and any attachment thereto may contain information which is > co > nfidential and/or protected by intellectual property rights and are > intende > d for the sole use of the recipient(s) named above. > Any use of the information contained herein (including, but not limited > to, > total or partial reproduction, communication or distribution in any form) > by other persons than the designated recipient(s) is prohibited. > If you have received this e-mail in error, please notify the sender either > by telephone or by e-mail and delete the material from any computer". > > Thank you for your cooperation. > > For further information about Proximus mobile phone services please see > our > website at http://www.proximus.be or refer to any Proximus agent. > > -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: Feature request
Hi Brix! > >> It also seems these options are incompatible: > >> --continue with --recursive [...] > JR> How should wget decide if it needs to "re-get" or "continue" the file? [...] Brix: > Not wanting to repeat my post from a few days ago (but doing so nevertheless) the >one way > without checking all files online is to have wget write the downloaded > file into a temp file (like *.wg! or something) and renaming it only > after completing the download. Sorry for not paying attention. It sounds like a good idea :) But I am no coder... CU Jens
Re: Feature request
Hi Frederic! > I'd like to know if there is a simple way to 'mirror' only the images > from a galley (ie. without thumbnails). [...] I won't address the options you suggested, because I think they should be evaluated by a developper/coder. However, as I often download galleries (and have some myself), I might be able to give you a few hints: Restricting files to be downloaded by a) file-name b) the directory they are in To a): -R*.gif,*tn*,*thumb*,*_jpg*,*small* you get the picture I guess (pun not intended, but funny nevertheless). Works quite well. To b): --reject-dir *thumb* (I am not sure about the correct spelling/syntax, I currently have neither wget nor winzip -or similar- on this machine, sorry!) > It also seems these options are incompatible: > --continue with --recursive > This could be useful, imho. IIRC, you are correct, but this is intentional. (right?) You probably think of the case where during a recursive download, the connection breaks and a large file is only partially downloaded. I could imagine that this might be useful. However, I see a problem when using timestamps, which normally require that a file be downloaded, if sizes local/on the server do not match, or the date on the server is newer. How should wget decide if it needs to "re-get" or "continue" the file? You could probably to "smart guessing", but the chance of false decisions persists. As a matter of fact, the problem is also existing when using --continue on a single file, but then it is the user's decision and the story is therefore quite different (I think). CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: Validating cookie domains
Hi Ian! > > This is amazingly stupid. > It seems to make more sense if you subtract one from the number of > periods. That was what I thought, too. > Could you assume that all two-letter TLDs are country-code TLDs and > require one more period than other TLDs (which are presumably at > least three characters long)? No, I don't think so. Take my sites, for example http://www.ichwillbagger-ladida.de http://ichwillbagger-ladida.de (remove the -ladida) both work. Or -as another phenomena I found- take http://www.uvex-ladida.de and http://uvex-ladida.de (remove the -ladida) They are different... I hope I did not miss your point. CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: feature wish: switch to disable robots.txt usage
Hi! Just to be complete, thanks to Hrvoje's tip, I was able to find -e command --execute command Execute command as if it were a part of .wgetrc (see Startup File.). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. I always wondered about that. *sigh* I can now think about changing my wgetgui in this aspect :) Thanks again Jens Hrvoje Niksic wrote: > > Noel Koethe <[EMAIL PROTECTED]> writes: > > > Ok got it. But it is possible to get this option as a switch for > > using it on the command line? > > Yes, like this: > > wget -erobots=off ...
Re: feature wish: switch to disable robots.txt usage
Hi Noel! Actually, this is possible. I think at least since 1.7, probably even longer. Cut from the doc: robots = on/off Use (or not) /robots.txt file (see Robots.). Be sure to know what you are doing before changing the default (which is on). Please note: This is a (.)wgetrc-only command. You cannot use it on the command line, if I am not mistaken. CU Jens Noel Koethe wrote: > > Hello, > > is it possible to get a new option to disable the usage > of robots.txt (--norobots)? > > for example: > I want to mirror some parts of http://ftp.de.debian.org/debian/ > but the admin have a robots.txt > > http://ftp.de.debian.org/robots.txt > User-agent: * > Disallow: / > > I think he want to protect his machine from searchengine > spiders and not from users want to download files.:) > > it would be great if I could use wget for this task but > now its not possible.:( > > Thanks alot. > > -- > Noèl Köthe
Re: LAN with Proxy, no Router
Hi Ian! > > wgetrc works fine under windows (always has) > > however, .wgetrc is not possible, but > > maybe . does mean "in root dir" under Unix? > > The code does different stuff for Windows. Instead of looking for > '.wgetrc' in the user's home directory, it looks for a file called > 'wget.ini' in the directory that contains the executable. This does > not seemed to be mentioned anywhere in the documentation. > >From my own experience, you are right concerning the location wget searches for wgetrc on Windows. However, a file called "wgetrc" is sufficient. In fact, wgetrc.ini will not be found and thus its options ignored. CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: LAN with Proxy, no Router
Hi! Someone please slap me with a gigantic sledgehammer?! *whump* Thanks! Oh man, how could I not see it? I mean, I used the "index" search function in the wget.hlp file. I should have searched the whole text. Even with index search "proxies" is just one line above "proxy". Oh well. Here is the report: Result: Works fine under windows with firewall and proxy over LAN into the www. How: Just put http_proxy = http://proxy.server.com:1234/ into the wgetrc file. Addition: wgetrc works fine under windows (always has) however, .wgetrc is not possible, but maybe . does mean "in root dir" under Unix? Thanks anyway, I think I'll go to bed now, oh boy... CU Jens Hrvoje Niksic wrote: > > Jens Rösner <[EMAIL PROTECTED]> writes: > > > Could someone please tell me, what > > "the appropriate environmental variable" is > > and how do I change it in Windows > > or what else I need to do? > > The variables are listed in the manual under Various->Proxies. Here > is the relevant part: > > `http_proxy' > This variable should contain the URL of the proxy for HTTP > connections. > > `ftp_proxy' > This variable should contain the URL of the proxy for FTP > connections. It is quite common that HTTP_PROXY and FTP_PROXY are > set to the same URL. > > `no_proxy' > This variable should contain a comma-separated list of domain > extensions proxy should _not_ be used for. For instance, if the > value of `no_proxy' is `.mit.edu', proxy will not be used to > retrieve documents from MIT. > > I'm no Windows expert, so someone else will need to explain how to set > them up. > > Another way is to tell Wget where the proxies are in its own config > file, `.wgetrc'. I'm not entirely sure how that works under Windows, > but you should be able to create a `.wgetrc' file in your home > directory and insert something like this: > > use_proxy = on > http_proxy = http://proxy.server.com:1234/ > ftp_proxy = http://proxy.server.com:1234/ > proxy_user = USER > proxy_passwd = PASSWD
LAN with Proxy, no Router
Hi! I recently managed to get my "big" machine online using a two PC (Windows boxes) LAN. A PI is the server, running both Zonealaram and Jana under Win98. The first one a firewall, the second one a proxy programme. On my client, an Athlon 1800+ with Windows 2000 I want to work with wget and download files over http from the www. For Netscape, I need to specify the LAN IP of the server as Proxy address. Setting up LeechFTP works similarly, IE is also set up (all three work). But wget does not work the way I "tried". I just basically started it, it failed (of course) and I searched the wget help and the www with google. However, the only thing that looks remotely like what I need is '' -Y on/off --proxy=on/off Turn proxy support on or off. The proxy is on by default if the appropriate environmental variable is defined. '' Could someone please tell me, what "the appropriate environmental variable" is and how do I change it in Windows or what else I need to do? I'd expect something like --proxy=on/off --proxy-address --proxy-user --proxy-passwd as a collection of proxy-related commands. All except --proxy-address=IP exist, so it is apparently not necessary. Kind regards Jens
Re: wget usage
Hi Gérard! I think you should have a look at the -p option. It stands for "page requisites" and should do exactly what you want. If I am not mistaken, -p was introduced in wget 1.8 and improved for 1.8.1 (the current version). CU Jens > I'd like to download a html file with its embedded > elements (e.g. .gif files). [PS: CC changed to the normal wget list]
Re: cuj.com file retrieving fails -why?
Hallo Markus! This is not a bug (I reckon) and should therefore have been sent to the normal wget list. Using both wget 1.7.1 and 1.8.1 on Windows the file is downloaded with wget -d -U "Mozilla/5.0 (compatible; Konqueror/2.2.1; Linux)" -r http://www.cuj.com/images/resource/experts/alexandr.gif as well as with wget http://www.cuj.com/images/resource/experts/alexandr.gif So, I do not know what your problem is, but is neither wget#s nor cuj's fault, AFAICT. CU Jens > > This problem is independent on whether a proxy is used or not: > The download hangs, though I can read the content using konqueror. > So what do cuj people do to inhibit automatic download and how can > I circumvent it? > > > wget --proxy=off -d -U "Mozilla/5.0 (compatible; Konqueror/2.2.1; Linux)" > -r http://www.cuj.com/images/resource/experts/alexandr.gif > DEBUG output created by Wget 1.7 on linux. > > parseurl ("http://www.cuj.com/images/resource/experts/alexandr.gif";) -> > host www.cuj.com -> opath images/resource/experts/alexandr.gif -> dir > images/resource/experts -> file alexandr.gif -> ndir > images/resource/experts > newpath: /images/resource/experts/alexandr.gif > Checking for www.cuj.com in host_name_address_map. > Checking for www.cuj.com in host_slave_master_map. > First time I hear about www.cuj.com by that name; looking it up. > Caching www.cuj.com <-> 66.35.216.85 > Checking again for www.cuj.com in host_slave_master_map. > --14:32:35-- http://www.cuj.com/images/resource/experts/alexandr.gif >=> `www.cuj.com/images/resource/experts/alexandr.gif' > Connecting to www.cuj.com:80... Found www.cuj.com in > host_name_address_map: 66.35.216.85 > Created fd 3. > connected! > ---request begin--- > GET /images/resource/experts/alexandr.gif HTTP/1.0 > User-Agent: Mozilla/5.0 (compatible; Konqueror/2.2.1; Linux) > Host: www.cuj.com > Accept: */* > Connection: Keep-Alive > > ---request end--- > HTTP request sent, awaiting response... > > nothing happens > > > Markus > -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: option changed: -nh -> -nH
Hi Noèl! -nh and -nH are totally different. from wget 1.7.1 (I think the last version to offer both): `-nh' `--no-host-lookup' Disable the time-consuming DNS lookup of almost all hosts (*note Host Checking::). `-nH' `--no-host-directories' Disable generation of host-prefixed directories. By default, invoking Wget with `-r http://fly.srk.fer.hr/' will create a structure of directories beginning with `fly.srk.fer.hr/'. This option disables such behavior. For wget 1.8.x -nh became the default behavior. Switching back to host-look-up is not possible. I already complained that many old scripts now break and suggested that entering -nh at the command line would either be completely ignored or the user would be informed and wget executed nevertheless. Apparently this was not regarded as useful. CU Jens > > The option --no-host-directories > changed from -nh to -nH (v1.8.1). > > Is there a reason for this? > It breaks a lot of scripts when upgrading, > I think. > > Could this be changed back to -nh? > > Thank you. > > -- > Noèl Köthe > -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: spanning hosts
Hi! Ian wrote: > Well I only said the URLs specified on the command line or by the > --include-file option are always downloaded. I didn't intend this > to be interpreted as also applying to URLs which Wget finds while > examining the contents of the downloaded html files. At the moment, > the domain acceptance/rejection checks are only performed when > downloaded html files are examined for further URLs to be > downloaded (for the --recursive and --page-requisites options), > which is why it behaves as it does. Ah! Now I understand, thanks for explaining again. [host wilödcards] > > -Dbar.com behaves strictly: www.bar.com, www2.bar.com > > -D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com > > -D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com, > > sex-bar.computer-dating.com [...] > It sounds like it should work okay. I'd prefer to let -Dbar.com > also match fubar.com for compatibility's sake. If you wanted to > match www.bar.com and www2.bar.com, but not www.fubar.com you > could use -D.bar.com, but that wouldn't work if you wanted to > match bar.com without the www (well, a leading . could be treated > as a special case). Sounds a bit more complicated to programme (that's why I did not suggest it), but I must admit I am a fan of backwards compatibility :) so your version sounds like a good idea. > It would be easiest and more consistent (currently) to use > "shell-globbing" wildcards (as used for the file-acceptance > rules) rather than grep/egrep-style wildcards. Well, you got me once again. Google found this page: http://www.mkssoftware.com/docs/man1/grep.1.asp Do I understand correctly that grep/egrep enables the user/programme to search files (strings/records?) for a string expression? While it appears (to me) to be more powerful than the mentioned wildcards, I do not see the compelling reason to use it, as I think that wildcard matching will work as well (apart from the consistency reason you mentioned). CU Jens
Re: spanning hosts
Howdy! > > I came across a crash caused by a cookie > > two days ago. I disabled cookies and it worked. > I'm hoping you had debug output on when it crashed, otherwise this > is a different crash to the one I already know about. Can you > confirm this, please? Yes, I had debug output on. > > wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d > > -R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html > > > > Result with 1.8.1 and 1.7.1 with -nh: > > audistory.com: Only index.html > > audistory.de: Everything > > audi100-online: only the first page > > kolaschnik.de: only the first page > > Yes, that's how I thought it would behave. Any URLs specified on > the command line or in a --include-file file are always downloaded > irregardless of the domain acceptance rules. Well, one page of a rejected URL is downloaded, not more. Whereas the only accepted domain audistory.de gets downloaded completely. Doesn't this differ from what you just said? > One of the changes you > desire is that the domain acceptance rules should apply to these > too, which sounds like a reasonable thing to expect. That's my impression, too (obviously ;) > > What I would have liked and expected: > > audistory.com: Everything > > audistory.de: Everything > > audi100-online: Everything > > kolaschnik.de: nothing > > That requires the first change and also different domain matching > rules. I don't think that should be changed without adding extra > options to do so. The --domains and --exclude-domains lists are > meant to be actual domain names. I.e. -Dbar.com is meant to match > bar.com and foo.bar.com, and it's just a happy accident that it > also matches fubar.com (perhaps it shouldn't, really). I think if > someone specified -Dbar.com and it matched > sex-bar.computer-dating.com, they might be a bit surprised! Agreed! How about introducing "wildcards" like -Dbar.com behaves strictly: www.bar.com, www2.bar.com -D*bar.com behaves like now: www.bar.com, www2.bar.com, www.foobar.com -D*bar.com* gets www.bar.com, www2.bar.com, www.foobar.com, sex-bar.computer-dating.com That would leave current command lines operational and introduce many possibilities without (too much) fuss. Or have I overlooked anything here? > > Independent from the the question how the string "audi" > > should be matched within the URL, I think rejected URLs > > should not be parsed or be retrieved. > > Well they are all parsed before it is decided whether to retrieve > them or not! Ooopsie again. /me looks up "parse" parse=analyse Yes, I understand now! Kind regards Jens
Re: spanning hosts: 2 Problems
Hi again, Ian and fellow wgeteers! > A debug log will be useful if you can produce one. Sure I (or wget) can and did. It is 60kB of text. Zipping? Attaching? > Also note that if receive cookies that expire around 2038 with > debugging on, the Windows version of Wget will crash! (This is a > known bug with a known fix, but not yet finalised in CVS.) Funny you mention that! I came across a crash caused by a cookie two days ago. I disabled cookies and it worked. Should have traced this a bit more. > > I just installed 1.7.1, which also works breadth-first. > (I think you mean depth-first.) *doh* /slaps forehead Of course, thanks. > used depth-first retrieval. There are advantages and disadvantages > with both types of retrieval. I understand, I followed (but not totally understood) the discussion back then. > > Of course, this is possible. > > I just had hoped that by combining > > -F -i url.html > > with domain acceptance would save me a lot of time. > > Oh, I think I see what your first complaint is now. I initially > assumed that your local html file was being served by a local HTTP > server rather than being fed to the -F -i options. Is your complaint really that URLs > supplied on the command line or via the > -i option are not subjected to the acceptance/rejection rules? That > does indeed seem to be the current behavior, but there is not > particular reason why we couldn't apply the tests to these URLs as > well as the URLs obtained through recursion. Well, you are confusing me a bit ;} Assume a file like http://www.audistory-nospam.com";>1 http://www.audistory-nospam.de";>2 http://www.audi100-online-nospam.de";>3 http://www.kolaschnik-nospam.de";>4 and a command line like wget -nc -x -r -l0 -t10 -H -Dstory.de,audi -o example.log -k -d -R.gif,.exe,*tn*,*thumb*,*small* -F -i example.html Result with 1.8.1 and 1.7.1 with -nh: audistory.com: Only index.html audistory.de: Everything audi100-online: only the first page kolaschnik.de: only the first page What I would have liked and expected: audistory.com: Everything audistory.de: Everything audi100-online: Everything kolaschnik.de: nothing Independent from the the question how the string "audi" should be matched within the URL, I think rejected URLs should not be parsed or be retrieved. I hope I could articulate what I wanted to say :) CU Jens
Re: spanning hosts: 2 Problems
Hi Ian! [..] > It's probably worth noting that the comparisons between the -D > strings and the domains being followed (or not) is anchored at > the ends of the strings, i.e. "-Dfoo" matches "bar.foo" but not > "foo.bar". *doh* Thanks for the info. I thought it would work similarly to the acceptance of files. > > The first page of even the rejected hosts gets saved. > That sounds like a bug. Should I try to get a useful debug log? (It is Windows, so I do not know if it is helpful.) [depth first] > > Now, with downloading from many (20+) different servers, this is a bit > > frustrating, > > as I will probably have the first completely downloaded site in a few > > days... > > Would that be less of a problem if the first problem (first page > >from rejected domains) was fixed? Not really, the problems are quite different for me. > > > Is there any other way to work around this besides installing wget 1.6 > > (or even 1.5?) > No, I just installed 1.7.1, which also works breadth-first. I now have two wget versions, no problem for me. > but note that if you pass several starting URLs to Wget, it > will complete the first before moving on to the second. That also > works for the URLs in the file specified by the --input-file > parameter. I know, I have used --input-file a great deal over the last few days. It's great, but not really applicable in this circumstance, as I did/do not want to manually extract the links from the html page. > The other alternative is to run wget > several times in sequence with different starting URLs and restrictions, perhaps >using > the --timestamping or --no-clobber > options to avoid downloading things more than once. Of course, this is possible. I just had hoped that by combining -F -i url.html with domain acceptance would save me a lot of time. It now works okay -with 1.7.1- but a domain acceptance/rejection like I said would be helpful for me. But I reckon not for many other users (right?). CU Jens
spanning hosts: 2 Problems
Hi wgeteers! I am using wget to parse a local html file which has numerous links into the www. Now, I only want hosts that include certain strings like -H -Daudi,vw,online.de Two things I don't like in the way wget 1.8.1 works on windows: The first page of even the rejected hosts gets saved. This messes up my directory structure as I force directories (which is my default and normally useful) I am aware that wget has switched to breadth first (as opposed to depth-first) retrieval. Now, with downloading from many (20+) different servers, this is a bit frustrating, as I will probably have the first completely downloaded site in a few days... Is there any other way to work around this besides installing wget 1.6 (or even 1.5?) Thanks Jens
Re: (Fwd) Proposed new --unfollowed-links option for wget
Hi List! As a non-wget-programmer I also think that this option may be very useful. I'd be happy to see it wget soon :) Just thought to drop in some positive feedback :) CU Jens > > -u, --unfollowed-links=FILE log unfollowed links to FILE. > Nice. It sounds useful.
Re: maybe code from pavuk would help
Hi Noèl! )message CC changed to normal wget list( Rate-limiting is possible since wget 1.7.1 or so, please correct me if it was 1.8! requests for "http post" pop up occasionaly, but as far as I am concerned, I don't need it and I think it is not in the scope of wget currently. Filling out forms could probably be very useful for some users I guess. If it would be possible without too much fuss, I would encourage this, eventhough I would not need it. BTW: Could you elaborate a bit more on the "..." part of your mail? BTW2: Why did you send this to the bug list?(insert multiple question marks here) CU Jens Noel Koethe schrieb: > > Hello, > > I tested pavuk (http://www.pavuk.org/, GPL) and there are some features > I miss in wget: > > -supports HTTP POST requests > -can automaticaly fill forms from HTML documents and make POST or GET > requestes based on user input and form content > -you can limit transfer rate over network (speed throttling) > ... > > Maybe there is some code which could be used in wget.:) > So the wheel wouldn't invented twice. > > -- > Noèl Köthe
RE: Accept list
Hi Peter! > I was using 1.5.3 > I am getting 1.8.1 now... Good idea, but... > > --accept="patchdiag.xref,103566*,103603*,103612*,103618*" > > 112502.readme > > 112504-01.zip > > 112504.readme > > 112518-01.zip > > 112518.readme [snip] ...look at the file names you want, none of them includes 103*, they all start with 112* So, wget works absolutely ok, I think. Or am I missing something here? CU Jens -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
Re: -H suggestion
Hi Hrvoje! > > requisites could be wherever they want. I mean, -p is already > > ignoring -np (since 1.8?), what I think is also very useful. > Since 1.8.1. I considered it a bit more "dangerous" to allow > downloading from just any host if the user has not allowed it > explicitly. You are of course right. > In either way, I was presented with a user interface problem. I > couldn't quite figure out how to arrange the options to allow for > three cases: Please don't hit me, if my suggestions do not follow wget naming conventions > * -p gets stuff from this host only, including requisites. I would make this the default -p behaviour, as it is the status quo. (Great band btw) > * -p gets stuff from this host only, but requisites may span hosts. How about --page-requisites-relaxed Too long? Do I understand correctly: --page-requisites-relaxed would let wget traverse only the base host, while the page requisites would travel to hosts specified after -H -Dhost1.com,host2.com right? > * everything may span hosts. Let -H ignore -p? Ah, no *doh* doesn't work. --page-requisites-open ? But how would wget --page-requisites-open -H -Dhost1.com,host2.com URL then be different from wget -H -Dhost1.com,host2.com URL ? And what if wget should travel host1.com and host2.com, but -p should only go to these two hosts and foo.com? Ok, I think this problem may be a bit constructed. And I am surely beginning to be confused at 3am. Sorry. CU Jens
Re: How does -P work?
Hi Herold! Thanks for the testing, I must admit, trying -nd did not occur to me :( I already have implemented a \ to / conversion in my wgetgui, but forgot to strip the trailing (as Hrvoje suggested) / *doh* Anyway, I would of course be happy to see a patch like you proposed, but I understand too little to judge where it belongs :} CU Jens http://www.JensRoesner.de/wgetgui/ > Note: tests done on NT4. W9x probably would behave different (even > worse). > starting from (for example) c:, with d: being another writable disk of > some kind, something like > wget -nd -P d:/dir http://www.previnet.it > does work as expected. > wget -nd -P d:\dir http://www.previnet.it > also does work as expected. > wget -P d:\dir http://www.previnet.it > did create a directory d:\@5Cdir and started from there, in other words > the \ is converted by wget since it doesn't recognize it as a valid > local directory separator. > wget -P d:/dir http://www.previnet.it > failed in a way or another for the impossibility to create the correct > directory or use it if already present. [snip]
Re: How does -P work?
Hi Hrvoje! > > Can I use -P (Directory prefix) to save files in a user-determinded > > folder on another drive under Windows? > > You should be able to do that. Try `-P C:/temp/'. Wget doesn't know > anything about windows backslashes, so maybe that's what made it fail. > The problem with / and \ was already solved, thx. The syntax folder/ is incorrect for wget on windows, it will try to save to folder//url :( Here is what I got with wget -nc -x -P c:/temp -r -l0 -p -np -t10 -d -o minusp.log http://www.jensroesner.de ### DEBUG output created by Wget 1.8.1-pre3 on Windows. Enqueuing http://www.jensroesner.de/ at depth 0 Queue count 1, maxcount 1. Dequeuing http://www.jensroesner.de/ at depth 0 Queue count 0, maxcount 1. --02:23:49-- http://www.jensroesner.de/ => `c:/temp/www.jensroesner.de/index.html' Resolving www.jensroesner.de... done. Caching www.jensroesner.de => 212.227.109.232 Connecting to www.jensroesner.de[212.227.109.232]:80... connected. Created socket 72. Releasing 009A0730 (new refcount 1). ---request begin--- GET / HTTP/1.0 User-Agent: Wget/1.8.1-pre3 Host: www.jensroesner.de Accept: */* Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 14 Jan 2002 13:05:44 GMT Age: 801 Server: Apache/1.3.22 (Unix) Last-Modified: Sun, 28 Oct 2001 01:38:38 GMT Accept-Ranges: bytes Content-Type: text/html Content-Length: 5371 Etag: "7f088c-14fb-3bdb619e" Via: 1.1 NetCache (NetCache 4.1R5D2) Length: 5,371 [text/html] c:/temp/www.jensroesner.de: File existsc:/temp/www.jensroesner.de/index.html: No such file or directory Closing fd 72 Cannot write to `c:/temp/www.jensroesner.de/index.html' (No such file or directory). FINISHED --02:24:00-- Downloaded: 0 bytes in 0 files ### Of course I have a c:/temp or c:\temp dir, but even if not, wget should create one, right? CU Jens
Re: Suggestion on job size
Hi Fred! First, I think this would rather belong in the normal wget list, as I cannot see a bug here. Sorry to the bug tracers, I am posting to the normal wget List and cc-ing Fred, hope that is ok. To your first request: -Q (Quota) should do precisely what you want. I used it with -k and it worked very well. Or am I missing your point here? Your second wish is AFAIK not possible now. Maybe in the future wget could write the record of downloaded files in the appropriate directory. After exiting wget, this file could then be used to process all the files mentioned in it. Just an idea, I would normally not think that this option is an often requested one. HOWEVER: -K works (when I understand it correctly) on the fly, as it decides on the run, if the server file is newer, if a previously converted file exists and what to do. So, only -k would work after the download, right? CU Jens http://www.JensRoesner.de/wgetgui/ > It would be nice to have some way to limit the total size of any job, and > have it exit gracefully upon reaching that size, by completing the -k -K > process upon termination, so that what one has downloaded is "useful." A > switch that would set the total size of all downloads --total-size=600MB > would terminate the run when the total bytes downloaded reached 600 MB, and > process the -k -K. What one had already downloaded would then be properly > linked for viewing. > > Probably more difficult would be a way of terminating the run manually > (Ctrl-break??), but then being able to run the -k -K process on the > already-downloaded files. > > Fred Holmes
How does -P work?
Hi! Can I use -P (Directory prefix) to save files in a user-determinded folder on another drive under Windows? I tried -PC:\temp\ which does not work (I am starting from D:\) Also -P..\test\ would not save into the dir above the current one. So I changed the \ into / and it worked. However, I still could not save to another drive with -Pc:/temp Any way around this? Bug/Feature? Windows/Unix problem? CU Jens
Re: -nh broken 1.8
Hi! > > 2. Wouldn't it be a good idea to mention the > > deletion of the -nh option in a file? > > Maybe. What file do you have in mind? First and foremost the "news" file, but I think it would also not be misplaced in wget.html and/or wget.hlp /.info (whatever it is called on Unix systems). > > 3. on a different aspect: > > All command lines with -nh that were created before 1.8 are now > > non-functional, > Those command lines will need to be adjusted for the new Wget. This > is sad, but unavoidable. Wget's command line options don't change > every day, but they are not guaranteed to be cast in stone either. I don't expect them to be lasting forever, I just meant that simply ignoring -nh in wget 1.8 would have been an easy way to avoid the hassle. I am of course thinking of my wGetGUI, where "no host look-up" is an option. So, I no have either to explain every user (not reading any manual of course), that s/he should only use this option if they have an old wget version. Or I could simply delete the -nh option and say that it is not important enough for all the users of old wget versions. And then there is the problem if someone upgrades from an old wget to a new one, but keeps his/her old copy of wgetgui, which now of course produces invalid 1.8 command lines :( CU Jens http://www.jensroesner.de/wgetgui/
Re: -nh broken 1.8
Hi Hrvoje! > > -nh does not work in 1.8 latest windows binary. > > By not working I mean that it is not recognized as a valid parameter. > > (-nh is no-host look-up and with it on, > > two domain names pointing to the same IP are treated as different) > > You no longer need `-nh' to get that kind of behavior: it is now the > default. Ok, then three questions: 1. Is there then now a way to "turn off -nh"? So that wget does not distinguish between domain names of the same IP? Or is this option irrelevant given the net's current structure? 2. Wouldn't it be a good idea to mention the d eletion of the -nh option in a file? Or was it already mentiones and I am too blind/stupid? 3. on a different aspect: All command lines with -nh that were created before 1.8 are now non-functional, except for the old versions of course. Would it be possible that new wget versions just ignore it and older versions still work. This would greatly enhance (forward) compatibility between different versions, something I would regard as at least desirable? CU Jens
-nh broken 1.8
Hi! I already posted this on the normal wget list, to which I am subscribed. Problem: -nh does not work in 1.8 latest windows binary. By not working I mean that it is not recognized as a valid parameter. (-nh is no-host look-up and with it on, two domain names pointing to the same IP are treated as different) I am not sure which version first had this problem, but 1.7 did not show it. I really would like to have this option back. Does anyone know where it is gone to? Maybe doing holidays? CU Jens http://www.jensroesner.de/wgetgui/
-nh -nH??
Hi wgeteers! I noticed that -nh (no host look-up) seems to be gone in 1.8.1. Is that right? At first I thought, "Oh, you fool, it is -nH, you mixed it up" But, obviously, these are two different options. I read the "news" file and the wget.hlp and wget.html but could not find an answer. I always thought that this option is quite important nowadays?! Any help appreciated. CU and a Merry Christmas Jens
Re: referer question
Hi Vladi! If you are using windows, you might try http://www.jensroesner.de/wgetgui/ it is a GUI for wGet written in VB 6.0. If you click on the checkbox "identify as browser", wGetGUI will create a command line like you want. I use it and it works for me. Hope this helps? CU Jens Vladi wrote: > is it possible (well I mean easy way:)) to make wget pass referer auto? > I mean for every url that wget tries to fetch to pass hostname as > referer. [snip] -- GMX - Die Kommunikationsplattform im Internet. http://www.gmx.net
wGetGUI now under the GPL
Hi guys&gals! ;) I just wanted to let you know, that with v0.5 of wGetGUI, it is now released under the GPL, so if you feel like modifying or laughing at the source code, you can now do so. CU Jens http://www.jensroesner.de/wgetgui
Re: Download problem
Hi! For all who cannot download the windows binaries, they are now available through my site: http://www.jensroesner.de/wgetgui/data/wget20010605-17b.zip And while you are there, why not download wGetGUI v0.4? :) http://www.jensroesner.de/wgetgui If Heiko is reading this: May I just keep the file on my site? And make it availabe to the public? CU Jens
Re: Download problem
Hi Chad! Strange, it works for me with this link http://space.tin.it/computer/hherold/wget20010605-17b.zip the old binary "1.6" is not availabe. If you cannot download it (have you tried with wGet? :), I can mail it to you, or if more people have the problem, add it temporarily to my site. CU Jens (w)get wGetGUI v0.4 at http://www.jensroesner.de/wgetgui > I'm still unable to download wget binary from > http://space.tin.it/computer/hherold/ for either 1.6 or 1.7 . Anyone have a > good link?
(complete) GUI for those Windows users
Hi there! First, let me introduce myself: I am studying mechanical engineering and for a lecture I am learning Visual Basic. I was looking for a non-brain-dead way to get used to it and when a friend of mine told me that he finds wGet too difficult to use I just went *bing* So, look what I have done: http://www.jensroesner.de/wgetgui Yes, it is a GUI and yes, it is not as powerful as the command line execution. I understand that most people who will read this are Unix/Linux users and as that might have no use for my programme. However, I would like to encourage you to send me any tips and bug reports you find. As I have not yet subscribed to the wget list, I would appreciate a CC to my e-mail address. Thanks! Jens