Hi, Can anyone tel me how to crawl all other pages of same domain. For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.
Following property is added in nutch-site.xml <property> <name>db.ignore.internal.links</name> <value>false</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> And following is added in regex-urlfilter.txt # accept anything else +. Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to crawl all other pages but not techcrunch.com's pages though it has got many other pages too. Please help..? Thanks, Vivek