Hi David, the easiest way is to delete the CrawlDb and to start the crawl from scratch. Since it's a site crawl this should be possible, at least, from time to time. Then delete documents from the index which haven't been updated.
A more sophisticated solution is not yet ready, see https://issues.apache.org/jira/browse/NUTCH-1932 Best, Sebastian On 07/27/2017 10:11 AM, d.ku...@technisat.de wrote: > Hey, > > currently I'm working on nutch with solr for our company pages. > > Assuming the following situation: > We have a website: > > www.mysite.lol<http://www.mysite.lol> > > at this site there is a Link: > www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/> > > As you can see there is a type I should be /testpage/: > > www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512-1564/> > > As our Framework doesn't care about the text before the ID, we could type > everything we want and the site will be displayed because of the id. That is > why both link are fine and there is no 404. > If I change the link from the mainpage to the correct one, let nutch crawl > the site again, an send is to solr, the old one is still found. > > So the link > www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/> > is still at the nutch db, because the link is valid --> no 404. But there is > no mainpage pointing to this website. How do I tell nutch to ignore sites, > which doesn't have a link to it. > Basically --> revalidating links and removing site without links to it? > > > > Mit freundlichen Grüßen > David Kumar > > Senior Software Engineer Java, B. Sc. > Projektmanager PIM > Abteilung Infotech > TechniSat Digital GmbH > Julius-Saxler-Straße 3 > TechniPark > D-54550 Daun / Germany > > Tel.: + 49 (0) 6592 / 712 -2826 > Fax: + 49 (0) 6592 / 712 -2829 > > www.technisat.com/de_DE/<http://www.technisat.com/de_DE/> > www.facebook.com/technisat > >