Hi All,
I use wget often.
On some sites I cannot download files.
On one such site I found this file "robots.txt".
Is this file the cause for wget not downloading the files.
Any way I could circumvent this and download this website.
best regards
Mettavihari
--
> On some sites I cannot download files.
> On one such site I found this file "robots.txt".
> Is this file the cause for wget not downloading the files.
>
> Any way I could circumvent this and download this website.
Yep, you have to set the User-Agent value (which i
On Sat, Jun 08, 2002 at 10:30:02PM -0500, Amy Rupp wrote:
>> On some sites I cannot download files.
>> On one such site I found this file "robots.txt".
>> Is this file the cause for wget not downloading the files.
Why not just put "robots=off" in your .wget
> >> On some sites I cannot download files.
> >> On one such site I found this file "robots.txt".
> >> Is this file the cause for wget not downloading the files.
>
> Why not just put "robots=off" in your .wgetrc?
>
> Read the docs, peop
ginal question, and *MY* question still stands: if you
robots=off does just that; it does neither read nor honor robots.txt.
As for User Agent, most sites like to see a string with WinXX or IE or
Explorer in them.
If you are using windows, go to a friend's web site and then ask them
to mail y
Hi
>>> Why not just put "robots=off" in your .wgetrc?
hey hey
the "robots.txt" didn't just appear in the website; someone's
put it there and thought about it. what's in there has a good reason.
you might be indexing old, doubled or invalid data, or y
Hi!
> >>> Why not just put "robots=off" in your .wgetrc?
> hey hey
> the "robots.txt" didn't just appear in the website; someone's
> put it there and thought about it. what's in there has a good reason.
Wll, from my own experience, the
Hi!
>
> > >>> Why not just put "robots=off" in your .wgetrc?
> > hey hey
> > the "robots.txt" didn't just appear in the website; someone's
> > put it there and thought about it. what's in there has a good reason.
> Wee
> From: Jens Rösner [mailto:[EMAIL PROTECTED]]
> Subject: Re: robots.txt
> > or your indexing mech might loop on it, or crash the
> server. who knows.
> I have yet to find a site which forces wGet into a "loop" as you said.
> Others on the list probably can esti
Hi
A little OT, just a last response
>> > >>> Why not just put "robots=off" in your .wgetrc?
>> > hey hey
>> > the "robots.txt" didn't just appear in the website
[snip]
>> > The only reason is
>> > you might be inde
achine when requested. They're up for testing.
Ok, I am sorry, I always thought that when something like this happens
the
person causing the "loop" would suffer most and therefore be punished
directly.
I did not imagine that the server could really go down in case of that
constellat
Is there any particular reason we don't have an option to ignore robots.txt?
On Wed, 18 Jul 2007, Josh Williams wrote:
> Is there any particular reason we don't have an option to ignore robots.txt?
There is no particular reason, so we do.
Maciej
On 7/18/07, Maciej W. Rozycki <[EMAIL PROTECTED]> wrote:
There is no particular reason, so we do.
As far as I can tell, there's nothing in the man page about it.
From: Josh Williams
> As far as I can tell, there's nothing in the man page about it.
It's pretty well hidden.
-e robots=off
At this point, I normally just grind my teeth instead of complaining
about the differences between the command-line options and the commands
in the ".wgetrc" sta
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Steven M. Schweda wrote:
> From: Josh Williams
>
>> As far as I can tell, there's nothing in the man page about it.
>
>It's pretty well hidden.
>
> -e robots=off
>
> At this point, I normally just grind my teeth instead of complaining
>
uot;?
No, I don't think we should nor do I think use of those features is "sneaky".
With regard to robots.txt, people use it when they don't want *automated*
spiders crawling through their sites. A well-crafted wget command that
downloads selected information from a site without re
nt? And what should
>> those warnings be? Something like, "Use of this feature may help
>> you download files from which wget would otherwise be blocked, but
>> it's kind of sneaky, and web site administrators may get upset and
>> block your IP address if they discov
Micah Cowan wrote:
> Don't we already follow typical etiquette by default? Or do you mean
> that to override non-default settings in the rcfile or whatnot?
We don't automatically use a --wait time between requests. I'm not sure what
other "nice" options we'd want to make easily available, but th
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Tony Lewis wrote:
> Micah Cowan wrote:
>
>> Don't we already follow typical etiquette by default? Or do you
>> mean that to override non-default settings in the rcfile or
>> whatnot?
>
> We don't automatically use a --wait time between requests. I'
Micah Cowan <[EMAIL PROTECTED]> writes:
> I think we should either be a "stub", or a fairly complete "manual"
> (and agree that the latter seems preferable); nothing half-way
> between: what we have now is a fairly incomplete manual.
Converting from Info to man is harder than it may seem. The sc
On Wed, 18 Jul 2007, Micah Cowan wrote:
The manpage doesn't need to give as detailed explanations as the info manual
(though, as it's auto-generated from the info manual, this could be hard to
avoid); but it should fully describe essential features.
I know GNU projects for some reason go with
Daniel Stenberg wrote:
On Wed, 18 Jul 2007, Micah Cowan wrote:
The manpage doesn't need to give as detailed explanations as the info
manual (though, as it's auto-generated from the info manual, this
could be hard to avoid); but it should fully describe essential
features.
I know GNU project
Dear Gnu Developers,
We just ran into a situation where we had to "spider" a site of our
own on a outsourced service because the company was going out of
business. Because wget respects the robots.txt file, however, we
could not get an archive made until we had the outsourced comp
Hi,
I am using wget (1.10, Debian Sarge) with the -p option to download web
pages for offline reading, e.g.
wget -p -k www.heise.de/index.html
This site uses icons on that page in the /icons path which is mentioned
at the same time in robots.txt. Browsers don't care but wget therefor
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Hrvoje Niksic wrote:
> Micah Cowan <[EMAIL PROTECTED]> writes:
>
>> I think we should either be a "stub", or a fairly complete "manual"
>> (and agree that the latter seems preferable); nothing half-way
>> between: what we have now is a fairly incomp
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Daniel Stenberg wrote:
> On Wed, 18 Jul 2007, Micah Cowan wrote:
>
>> The manpage doesn't need to give as detailed explanations as the info
>> manual (though, as it's auto-generated from the info manual, this
>> could be hard to avoid); but it shoul
-e robots=off
Jon W. Backstrom wrote:
> Dear Gnu Developers,
>
> We just ran into a situation where we had to "spider" a site of our
> own on a outsourced service because the company was going out of
> business. Because wget respects the robots.txt file, however, we
&
Thomas Boerner wrote:
> Is this behaviour: "robots.txt takes precedence over -p" a bug or
> a feature?
It is a feature. If you want to ignore robots.txt, use this command line:
wget -p -k www.heise.de/index.html -e robots=off
Tony
On Sunday 10 July 2005 09:52 am, Tony Lewis wrote:
> Thomas Boerner wrote:
> > Is this behaviour: "robots.txt takes precedence over -p" a bug or
> > a feature?
>
> It is a feature. If you want to ignore robots.txt, use this command line:
>
> wget -p -k
I hope that doesn't happen. While respecting robots.txt is not an
absolute requirement, it is considered polite. I would not want the
default behavior of wget to be considered impolite.
Mark Post
-Original Message-
From: Mauro Tortonesi [mailto:[EMAIL PROTECTED]
Sent: Monday, A
On Monday 08 August 2005 07:30 pm, Post, Mark K wrote:
> I hope that doesn't happen. While respecting robots.txt is not an
> absolute requirement, it is considered polite. I would not want the
> default behavior of wget to be considered impolite.
IMVHO, hrvoje has a good point whe
I would say the analogy is closer to a very rabid person operating a web
browser. I've never been greatly inconvenienced by having to re-run a
download while ignoring the robots.txt file. As I said, respecting
robots.txt is not a requirement, but it is polite. I prefer my tools to
be p
By ignoring robots.txt, it may help reduce frustration when users who
aren't familiar with robots.txt can't figure out why the pages they want
aren't downloading.
The problem with trying to define a default behavior with wget is that
it lies somewhere between a web crawler
Hi:
I have used the reject = option successfully in the
start up file. This was for files like js (JavaScript)
etc. However, it seems that I am unable to make it
work for rejecting robots.txt files. I used:
reject = "robots*"
I even tried reject = "robots.txt"
Well...he
Micah Cowan <[EMAIL PROTECTED]> writes:
>> Converting from Info to man is harder than it may seem. The script
>> that does it now is basically a hack that doesn't really work well
>> even for the small part of the manual that it tries to cover.
>
> I'd noticed. :)
>
> I haven't looked at the scri
dk
> Subject: Man pages [Re: ignoring robots.txt]
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Daniel Stenberg wrote:
> > On Wed, 18 Jul 2007, Micah Cowan wrote:
> >
> >> The manpage doesn't need to give as detailed explanations
>
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Christopher G. Lewis wrote:
> Micah et al. -
>
> Just for an FYI - the whole texi->info, texi->html and
> (texi->rtf->hlp) is *very* fragile in the windows world. You actually
> have to download a *very* old version of makeinfo (1.68, not even o
Hello,
is it possible to get a new option to disable the usage
of robots.txt (--norobots)?
for example:
I want to mirror some parts of http://ftp.de.debian.org/debian/
but the admin have a robots.txt
http://ftp.de.debian.org/robots.txt
User-agent: *
Disallow: /
I think he want to protect his
Using wget 1.8.2:
$ wget --page-requisites http://news.com.com
...fails to retrieve most of the files that are required to properly
render the HTML document, because they are forbidden by
http://news.com.com/robots.txt .
I think that use of --page-requisites implies that wget is being used
Hi,
I am mirroring a friendly site that excludes robots in general but
is supposed to allow my "FriendlyMirror" using wget.
For this purpose I asked the webadmin to set up his robots.txt as follows:
User-agent: FriendlyMirror
Disallow:
User-agent: *
Disallow: /
Starting Wget by
w
1. web sites sometimes use .php or other crap (asp) instead of htm or html.
they should all be hurt with a cluestick, but for the moment wget -nc
doesn't cooperate with them. maybe rely on content-type instead.
2. when wget -r -H and hitting https://www.cybersitter.com/robots.txt
> ... Perhaps it should be one of those things that one can do
> oneself if one must but is generally frowned upon (like making a
> version of wget that ignores robots.txt).
Damn. I was only joking about ignoring robots.txt, but now I'm
thinking[1] there may be good reasons to do s
Hi Noel!
Actually, this is possible.
I think at least since 1.7, probably even longer.
Cut from the doc:
robots = on/off
Use (or not) /robots.txt file (see Robots.). Be sure to know what you
are doing before changing the default (which is on).
Please note:
This is a (.)wgetrc-only command.
You
On Mit, 10 Apr 2002, Jens Rösner wrote:
Hello Jens,
> > is it possible to get a new option to disable the usage
> > of robots.txt (--norobots)?
> I think at least since 1.7, probably even longer.
> Cut from the doc:
> robots = on/off
> Use (or not) /robots.txt file (
Noel Koethe <[EMAIL PROTECTED]> writes:
> Ok got it. But it is possible to get this option as a switch for
> using it on the command line?
Yes, like this:
wget -erobots=off ...
Hi!
Just to be complete, thanks to Hrvoje's tip,
I was able to find
-e command
--execute command
Execute command as if it were a part of .wgetrc (see Startup File.).
A command thus invoked will be executed after the
commands in .wgetrc, thus taking precedence over them.
I always wondered ab
This patch seems to do user-agent checks correctly (it might have been
broken previously) with a correction to a string comparison macro.
The patch also uses the value of the --user-agent option when enforcing
robots.txt rules.
this patch is against CVS, more on that here:
http://www.gnu.org
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256
Tony Godshall wrote:
>> ... Perhaps it should be one of those things that one can do
>> oneself if one must but is generally frowned upon (like making a
>> version of wget that ignores robots.txt).
>
> Damn. I was on
> Tony Godshall wrote:
> >> ... Perhaps it should be one of those things that one can do
> >> oneself if one must but is generally frowned upon (like making a
> >> version of wget that ignores robots.txt).
> >
> > Damn. I was only joking about ignor
Hello,
Given the following robots.txt file:
User-agent: *
Disallow: /folder/bob.php?
...
One would expect that if wget tries to download a link to
/folder/bob.php?a=1 that it would exclude it because of the robots rule
line - but it doesn't (my reading of the RFC indicates that it s
Hello,
Given the following robots.txt file:
User-agent: *
Disallow: /folder/bob.php?
...
One would expect that if wget tries to download a link to
/folder/bob.php?a=1 that it would exclude it because of the robots rule
line - but it doesn't (my reading of the RFC indicates that it s
52 matches
Mail list logo