Hi Tejas, Right, this is because of back up files. Thank you very much for
the support.


On Thu, Dec 27, 2012 at 3:27 PM, Tejas Patil <tejas.patil...@gmail.com>wrote:

> This might be the reason: You are using GEdit to edit the seeds file. It
> creates a backup of the old version of the file when changes are made to
> it. The backup file is hidden.
>
> Check the contents of the urls directory using this command: *ls -a urls*
> (to be executed from NUTCH_HOME. In your setup its ~/nutch_new_setup)
> *
> *
> This might give you:
> *.  ..  seed.txt  seed.txt~*
>
> seed.txt, the updated version, will have
> http://localhost:8080/nutch-test-site/chi.html  while the backup version,
> seed.txt~ will have the sony.com and usc.edu urls. The second file is a
> hidden file.
>
> Nutch scans the "urls" directory and gets *all* the files inside it... both
> the files are getting picked by nutch and hence you see the old urls too.
> Delete the hidden file urls/seeds.txt~ and try a fresh crawl.
>
> Thanks,
>  Tejas Patil
>
> On Wed, Dec 26, 2012 at 8:54 PM, Rajani Maski <rajinima...@gmail.com>
> wrote:
>
> >  http://localhost:8080/nutch-test-site/chi.html
> >
>

Reply via email to