robots.txt

2002-06-08 Thread rsync
Hi All, I use wget often. On some sites I cannot download files. On one such site I found this file "robots.txt". Is this file the cause for wget not downloading the files. Any way I could circumvent this and download this website. best regards Mettavihari --

Re: robots.txt

2002-06-08 Thread Amy Rupp
> On some sites I cannot download files. > On one such site I found this file "robots.txt". > Is this file the cause for wget not downloading the files. > > Any way I could circumvent this and download this website. Yep, you have to set the User-Agent value (which i

Re: robots.txt

2002-06-08 Thread Alan E
On Sat, Jun 08, 2002 at 10:30:02PM -0500, Amy Rupp wrote: >> On some sites I cannot download files. >> On one such site I found this file "robots.txt". >> Is this file the cause for wget not downloading the files. Why not just put "robots=off" in your .wget

Re: robots.txt

2002-06-08 Thread Amy Rupp
> >> On some sites I cannot download files. > >> On one such site I found this file "robots.txt". > >> Is this file the cause for wget not downloading the files. > > Why not just put "robots=off" in your .wgetrc? > > Read the docs, peop

Re: robots.txt

2002-06-08 Thread Alan E
ginal question, and *MY* question still stands: if you robots=off does just that; it does neither read nor honor robots.txt. As for User Agent, most sites like to see a string with WinXX or IE or Explorer in them. If you are using windows, go to a friend's web site and then ask them to mail y

Re: robots.txt

2002-06-09 Thread Pike
Hi >>> Why not just put "robots=off" in your .wgetrc? hey hey the "robots.txt" didn't just appear in the website; someone's put it there and thought about it. what's in there has a good reason. you might be indexing old, doubled or invalid data, or y

Re: robots.txt

2002-06-09 Thread Jens Rösner
Hi! > >>> Why not just put "robots=off" in your .wgetrc? > hey hey > the "robots.txt" didn't just appear in the website; someone's > put it there and thought about it. what's in there has a good reason. Wll, from my own experience, the

Re: robots.txt

2002-06-09 Thread rsync
Hi! > > > >>> Why not just put "robots=off" in your .wgetrc? > > hey hey > > the "robots.txt" didn't just appear in the website; someone's > > put it there and thought about it. what's in there has a good reason. > Wee

RE: robots.txt

2002-06-10 Thread Herold Heiko
> From: Jens Rösner [mailto:[EMAIL PROTECTED]] > Subject: Re: robots.txt > > or your indexing mech might loop on it, or crash the > server. who knows. > I have yet to find a site which forces wGet into a "loop" as you said. > Others on the list probably can esti

Re: robots.txt

2002-06-10 Thread Pike
Hi A little OT, just a last response >> > >>> Why not just put "robots=off" in your .wgetrc? >> > hey hey >> > the "robots.txt" didn't just appear in the website [snip] >> > The only reason is >> > you might be inde

Re: robots.txt

2002-06-10 Thread Jens Rösner
achine when requested. They're up for testing. Ok, I am sorry, I always thought that when something like this happens the person causing the "loop" would suffer most and therefore be punished directly. I did not imagine that the server could really go down in case of that constellat

ignoring robots.txt

2007-07-18 Thread Josh Williams
Is there any particular reason we don't have an option to ignore robots.txt?

Re: ignoring robots.txt

2007-07-18 Thread Maciej W. Rozycki
On Wed, 18 Jul 2007, Josh Williams wrote: > Is there any particular reason we don't have an option to ignore robots.txt? There is no particular reason, so we do. Maciej

Re: ignoring robots.txt

2007-07-18 Thread Josh Williams
On 7/18/07, Maciej W. Rozycki <[EMAIL PROTECTED]> wrote: There is no particular reason, so we do. As far as I can tell, there's nothing in the man page about it.

Re: ignoring robots.txt

2007-07-18 Thread Steven M. Schweda
From: Josh Williams > As far as I can tell, there's nothing in the man page about it. It's pretty well hidden. -e robots=off At this point, I normally just grind my teeth instead of complaining about the differences between the command-line options and the commands in the ".wgetrc" sta

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Steven M. Schweda wrote: > From: Josh Williams > >> As far as I can tell, there's nothing in the man page about it. > >It's pretty well hidden. > > -e robots=off > > At this point, I normally just grind my teeth instead of complaining >

RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
uot;? No, I don't think we should nor do I think use of those features is "sneaky". With regard to robots.txt, people use it when they don't want *automated* spiders crawling through their sites. A well-crafted wget command that downloads selected information from a site without re

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan
nt? And what should >> those warnings be? Something like, "Use of this feature may help >> you download files from which wget would otherwise be blocked, but >> it's kind of sneaky, and web site administrators may get upset and >> block your IP address if they discov

RE: ignoring robots.txt

2007-07-18 Thread Tony Lewis
Micah Cowan wrote: > Don't we already follow typical etiquette by default? Or do you mean > that to override non-default settings in the rcfile or whatnot? We don't automatically use a --wait time between requests. I'm not sure what other "nice" options we'd want to make easily available, but th

Re: ignoring robots.txt

2007-07-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Lewis wrote: > Micah Cowan wrote: > >> Don't we already follow typical etiquette by default? Or do you >> mean that to override non-default settings in the rcfile or >> whatnot? > > We don't automatically use a --wait time between requests. I'

Re: ignoring robots.txt

2007-07-18 Thread Hrvoje Niksic
Micah Cowan <[EMAIL PROTECTED]> writes: > I think we should either be a "stub", or a fairly complete "manual" > (and agree that the latter seems preferable); nothing half-way > between: what we have now is a fairly incomplete manual. Converting from Info to man is harder than it may seem. The sc

Re: ignoring robots.txt

2007-07-19 Thread Daniel Stenberg
On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I know GNU projects for some reason go with

Re: ignoring robots.txt

2007-07-19 Thread Andreas Pettersson
Daniel Stenberg wrote: On Wed, 18 Jul 2007, Micah Cowan wrote: The manpage doesn't need to give as detailed explanations as the info manual (though, as it's auto-generated from the info manual, this could be hard to avoid); but it should fully describe essential features. I know GNU project

WGET and the robots.txt file...

2002-09-11 Thread Jon W. Backstrom
Dear Gnu Developers, We just ran into a situation where we had to "spider" a site of our own on a outsourced service because the company was going out of business. Because wget respects the robots.txt file, however, we could not get an archive made until we had the outsourced comp

robots.txt takes precedence over -p

2005-07-10 Thread Thomas Boerner
Hi, I am using wget (1.10, Debian Sarge) with the -p option to download web pages for offline reading, e.g. wget -p -k www.heise.de/index.html This site uses icons on that page in the /icons path which is mentioned at the same time in robots.txt. Browsers don't care but wget therefor

Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Hrvoje Niksic wrote: > Micah Cowan <[EMAIL PROTECTED]> writes: > >> I think we should either be a "stub", or a fairly complete "manual" >> (and agree that the latter seems preferable); nothing half-way >> between: what we have now is a fairly incomp

Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Daniel Stenberg wrote: > On Wed, 18 Jul 2007, Micah Cowan wrote: > >> The manpage doesn't need to give as detailed explanations as the info >> manual (though, as it's auto-generated from the info manual, this >> could be hard to avoid); but it shoul

Re: WGET and the robots.txt file...

2002-09-11 Thread Max Bowsher
-e robots=off Jon W. Backstrom wrote: > Dear Gnu Developers, > > We just ran into a situation where we had to "spider" a site of our > own on a outsourced service because the company was going out of > business. Because wget respects the robots.txt file, however, we &

RE: robots.txt takes precedence over -p

2005-07-10 Thread Tony Lewis
Thomas Boerner wrote: > Is this behaviour: "robots.txt takes precedence over -p" a bug or > a feature? It is a feature. If you want to ignore robots.txt, use this command line: wget -p -k www.heise.de/index.html -e robots=off Tony

Re: robots.txt takes precedence over -p

2005-08-08 Thread Mauro Tortonesi
On Sunday 10 July 2005 09:52 am, Tony Lewis wrote: > Thomas Boerner wrote: > > Is this behaviour: "robots.txt takes precedence over -p" a bug or > > a feature? > > It is a feature. If you want to ignore robots.txt, use this command line: > > wget -p -k

RE: robots.txt takes precedence over -p

2005-08-08 Thread Post, Mark K
I hope that doesn't happen. While respecting robots.txt is not an absolute requirement, it is considered polite. I would not want the default behavior of wget to be considered impolite. Mark Post -Original Message- From: Mauro Tortonesi [mailto:[EMAIL PROTECTED] Sent: Monday, A

Re: robots.txt takes precedence over -p

2005-08-08 Thread Mauro Tortonesi
On Monday 08 August 2005 07:30 pm, Post, Mark K wrote: > I hope that doesn't happen. While respecting robots.txt is not an > absolute requirement, it is considered polite. I would not want the > default behavior of wget to be considered impolite. IMVHO, hrvoje has a good point whe

RE: robots.txt takes precedence over -p

2005-08-08 Thread Post, Mark K
I would say the analogy is closer to a very rabid person operating a web browser. I've never been greatly inconvenienced by having to re-run a download while ignoring the robots.txt file. As I said, respecting robots.txt is not a requirement, but it is polite. I prefer my tools to be p

Re: robots.txt takes precedence over -p

2005-08-09 Thread Frank McCown
By ignoring robots.txt, it may help reduce frustration when users who aren't familiar with robots.txt can't figure out why the pages they want aren't downloading. The problem with trying to define a default behavior with wget is that it lies somewhere between a web crawler

Reject Option - especially for robots.txt files

2005-10-04 Thread Rahul Joshi
Hi: I have used the reject = option successfully in the start up file. This was for files like js (JavaScript) etc. However, it seems that I am unable to make it work for rejecting robots.txt files. I used: reject = "robots*" I even tried reject = "robots.txt" Well...he

Re: Man pages [Re: ignoring robots.txt]

2007-07-18 Thread Hrvoje Niksic
Micah Cowan <[EMAIL PROTECTED]> writes: >> Converting from Info to man is harder than it may seem. The script >> that does it now is basically a hack that doesn't really work well >> even for the small part of the manual that it tries to cover. > > I'd noticed. :) > > I haven't looked at the scri

RE: Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Christopher G. Lewis
dk > Subject: Man pages [Re: ignoring robots.txt] > > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA256 > > Daniel Stenberg wrote: > > On Wed, 18 Jul 2007, Micah Cowan wrote: > > > >> The manpage doesn't need to give as detailed explanations >

Re: Man pages [Re: ignoring robots.txt]

2007-07-19 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher G. Lewis wrote: > Micah et al. - > > Just for an FYI - the whole texi->info, texi->html and > (texi->rtf->hlp) is *very* fragile in the windows world. You actually > have to download a *very* old version of makeinfo (1.68, not even o

feature wish: switch to disable robots.txt usage

2002-04-10 Thread Noel Koethe
Hello, is it possible to get a new option to disable the usage of robots.txt (--norobots)? for example: I want to mirror some parts of http://ftp.de.debian.org/debian/ but the admin have a robots.txt http://ftp.de.debian.org/robots.txt User-agent: * Disallow: / I think he want to protect his

wget bug (?): --page-requisites should supercede robots.txt

2002-09-22 Thread Jamie Flournoy
Using wget 1.8.2: $ wget --page-requisites http://news.com.com ...fails to retrieve most of the files that are required to properly render the HTML document, because they are forbidden by http://news.com.com/robots.txt . I think that use of --page-requisites implies that wget is being used

using user-agent to identify for robots.txt

2003-05-28 Thread Christian von Ferber
Hi, I am mirroring a friendly site that excludes robots in general but is supposed to allow my "FriendlyMirror" using wget. For this purpose I asked the webadmin to set up his robots.txt as follows: User-agent: FriendlyMirror Disallow: User-agent: * Disallow: / Starting Wget by w

wget PHP / hang on https://www.cybersitter.com/robots.txt

2004-11-22 Thread johannes . schwabe
1. web sites sometimes use .php or other crap (asp) instead of htm or html. they should all be hurt with a cluestick, but for the moment wget -nc doesn't cooperate with them. maybe rely on content-type instead. 2. when wget -r -H and hitting https://www.cybersitter.com/robots.txt

Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Tony Godshall
> ... Perhaps it should be one of those things that one can do > oneself if one must but is generally frowned upon (like making a > version of wget that ignores robots.txt). Damn. I was only joking about ignoring robots.txt, but now I'm thinking[1] there may be good reasons to do s

Re: feature wish: switch to disable robots.txt usage

2002-04-10 Thread Jens Rösner
Hi Noel! Actually, this is possible. I think at least since 1.7, probably even longer. Cut from the doc: robots = on/off Use (or not) /robots.txt file (see Robots.). Be sure to know what you are doing before changing the default (which is on). Please note: This is a (.)wgetrc-only command. You

Re: feature wish: switch to disable robots.txt usage

2002-04-10 Thread Noel Koethe
On Mit, 10 Apr 2002, Jens Rösner wrote: Hello Jens, > > is it possible to get a new option to disable the usage > > of robots.txt (--norobots)? > I think at least since 1.7, probably even longer. > Cut from the doc: > robots = on/off > Use (or not) /robots.txt file (

Re: feature wish: switch to disable robots.txt usage

2002-04-10 Thread Hrvoje Niksic
Noel Koethe <[EMAIL PROTECTED]> writes: > Ok got it. But it is possible to get this option as a switch for > using it on the command line? Yes, like this: wget -erobots=off ...

Re: feature wish: switch to disable robots.txt usage

2002-04-10 Thread Jens Rösner
Hi! Just to be complete, thanks to Hrvoje's tip, I was able to find -e command --execute command Execute command as if it were a part of .wgetrc (see Startup File.). A command thus invoked will be executed after the commands in .wgetrc, thus taking precedence over them. I always wondered ab

Re: using user-agent to identify for robots.txt

2003-05-30 Thread Aaron S. Hawley
This patch seems to do user-agent checks correctly (it might have been broken previously) with a correction to a string comparison macro. The patch also uses the value of the --user-agent option when enforcing robots.txt rules. this patch is against CVS, more on that here: http://www.gnu.org

Re: Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Tony Godshall wrote: >> ... Perhaps it should be one of those things that one can do >> oneself if one must but is generally frowned upon (like making a >> version of wget that ignores robots.txt). > > Damn. I was on

Re: Ignoring robots.txt [was Re: wget default behavior...]

2007-10-17 Thread Tony Godshall
> Tony Godshall wrote: > >> ... Perhaps it should be one of those things that one can do > >> oneself if one must but is generally frowned upon (like making a > >> version of wget that ignores robots.txt). > > > > Damn. I was only joking about ignor

URLs such as "site.com/folder/bob.php?a=1" and robots.txt problem

2006-06-08 Thread henka
Hello, Given the following robots.txt file: User-agent: * Disallow: /folder/bob.php? ... One would expect that if wget tries to download a link to /folder/bob.php?a=1 that it would exclude it because of the robots rule line - but it doesn't (my reading of the RFC indicates that it s

URLs such as "site.com/folder/bob.php?a=1" and robots.txt problem

2006-07-05 Thread henka
Hello, Given the following robots.txt file: User-agent: * Disallow: /folder/bob.php? ... One would expect that if wget tries to download a link to /folder/bob.php?a=1 that it would exclude it because of the robots rule line - but it doesn't (my reading of the RFC indicates that it s