Crawling with nutch, check Links

d.ku...@technisat.de Thu, 27 Jul 2017 01:12:08 -0700

Hey,

currently I'm working on nutch with solr for our company pages.


Assuming the following situation:
We have a website:

www.mysite.lol<http://www.mysite.lol>

at this site there is a Link:
www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>

As you can see there is a type I should be /testpage/:

www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512-1564/>

As our Framework doesn't care about the text before the ID, we could type 
everything we want and the site will be displayed because of the id. That is 
why both link are fine and there is no 404.
If I change the link from the mainpage to the correct one, let nutch crawl the 
site again, an send is to solr, the old one is still found.

So the link
www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1564/>
is still at the nutch db, because the link is valid --> no 404. But there is no 
mainpage pointing to this website. How do I tell nutch to ignore sites, which 
doesn't have a link to it.
Basically --> revalidating links and removing site without links to it?



Mit freundlichen Grüßen
David Kumar

Senior Software Engineer Java, B. Sc.
Projektmanager PIM
Abteilung Infotech
TechniSat Digital GmbH
Julius-Saxler-Straße 3
TechniPark
D-54550 Daun / Germany

Tel.: + 49 (0) 6592 / 712 -2826
Fax: + 49 (0) 6592 / 712 -2829

www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
www.facebook.com/technisat

Crawling with nutch, check Links

Reply via email to