Re: [Bug-wget] robots.txt seemingly ignored

Daniel Feenberg Tue, 15 May 2018 14:00:06 -0700

Thank you. Updating to 1.19 fixed the problem. The version 1.12 came from
the Scientific Linux v6 repository. I didn't realize it was so old.
Installing 1.19 was easy - just configure;make;make install


Thanks again.
Daniel Feenberg

On Tue, May 15, 2018 at 5:34 AM, Darshit Shah <[email protected]> wrote:

> Hi,
>
> You are using a very old version of Wget.  v1.12 was released in 2009 if I
> remember correctly.
>
> The current version of Wget doesn't seem to have any issues with the
> parsing of
> that robots.txt. I just tried it locally and it downloads no files at all.
>
> Please update your version of Wget.
>
> * Daniel Feenberg <[email protected]> [180514 16:51]:
> >
> > I have the following wget command line:
> >
> >    wget -r  http://wwwdev.nber.org/
> >
> > http://wwwdev.nber.org/robots.txt  is:
> >
> >   User-agent: *
> >   Disallow: /
> >
> >   User-Agent: W3C-checklink
> >   Disallow:
> >
> >
> > However wget fetches thousands of pages from wwwdev.nber.org. I would
> have
> > thought nothing would be found. (This is a demonstration, obviously in
> real
> > life I'd have a more detailed robots.txt to control the process).
> >
> > Obviously too, I don't understand something about wget or robots.txt. Can
> > anyone help me out?
> >
> > This is GNU Wget 1.12 built on linux-gnu.
> >
> > Thank you
> > Daniel Feenberg
> >
>
> --
> Thanking You,
> Darshit Shah
> PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6
>

Re: [Bug-wget] robots.txt seemingly ignored

Reply via email to