Re: Crawling with nutch, check Links

Sebastian Nagel Fri, 28 Jul 2017 03:09:11 -0700

Hi David,

the easiest way is to delete the CrawlDb and to start the crawl from scratch.
Since it's a site crawl this should be possible, at least, from time to time.
Then delete documents from the index which haven't been updated.


A more sophisticated solution is not yet ready, see
  https://issues.apache.org/jira/browse/NUTCH-1932

Best,
Sebastian

On 07/27/2017 10:11 AM, d.ku...@technisat.de wrote:
> Hey,
> 
> currently I'm working on nutch with solr for our company pages.
> 
> Assuming the following situation:
> We have a website:
> 
> www.mysite.lol<http://www.mysite.lol>
> 
> at this site there is a Link:
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>
> 
> As you can see there is a type I should be /testpage/:
> 
> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512-1564/>
> 
> As our Framework doesn't care about the text before the ID, we could type 
> everything we want and the site will be displayed because of the id. That is 
> why both link are fine and there is no 404.
> If I change the link from the mainpage to the correct one, let nutch crawl 
> the site again, an send is to solr, the old one is still found.
> 
> So the link
> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>
> is still at the nutch db, because the link is valid --> no 404. But there is 
> no mainpage pointing to this website. How do I tell nutch to ignore sites, 
> which doesn't have a link to it.
> Basically --> revalidating links and removing site without links to it?
> 
> 
> 
> Mit freundlichen Grüßen
> David Kumar
> 
> Senior Software Engineer Java, B. Sc.
> Projektmanager PIM
> Abteilung Infotech
> TechniSat Digital GmbH
> Julius-Saxler-Straße 3
> TechniPark
> D-54550 Daun / Germany
> 
> Tel.: + 49 (0) 6592 / 712 -2826
> Fax: + 49 (0) 6592 / 712 -2829
> 
> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
> www.facebook.com/technisat
> 
>

Re: Crawling with nutch, check Links

Reply via email to