crawling all links of same domain in nutch in solr

Vivekanand Ittigi Mon, 28 Jul 2014 22:19:25 -0700

Hi,

Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.


Following property is added in nutch-site.xml

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

And following is added in regex-urlfilter.txt

# accept anything else
+.

Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
crawl all other pages but not techcrunch.com's pages though it has got many
other pages too.

Please help..?

Thanks,
Vivek

crawling all links of same domain in nutch in solr

Reply via email to