Hi Tejas, Right, this is because of back up files. Thank you very much for the support.
On Thu, Dec 27, 2012 at 3:27 PM, Tejas Patil <tejas.patil...@gmail.com>wrote: > This might be the reason: You are using GEdit to edit the seeds file. It > creates a backup of the old version of the file when changes are made to > it. The backup file is hidden. > > Check the contents of the urls directory using this command: *ls -a urls* > (to be executed from NUTCH_HOME. In your setup its ~/nutch_new_setup) > * > * > This might give you: > *. .. seed.txt seed.txt~* > > seed.txt, the updated version, will have > http://localhost:8080/nutch-test-site/chi.html while the backup version, > seed.txt~ will have the sony.com and usc.edu urls. The second file is a > hidden file. > > Nutch scans the "urls" directory and gets *all* the files inside it... both > the files are getting picked by nutch and hence you see the old urls too. > Delete the hidden file urls/seeds.txt~ and try a fresh crawl. > > Thanks, > Tejas Patil > > On Wed, Dec 26, 2012 at 8:54 PM, Rajani Maski <rajinima...@gmail.com> > wrote: > > > http://localhost:8080/nutch-test-site/chi.html > > >