Hi,

Can anyone tel me how to crawl all other pages of same domain.
For example i'm feeding a website http://www.techcrunch.com/ in seed.txt.

Following property is added in nutch-site.xml

<property>
  <name>db.ignore.internal.links</name>
  <value>false</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>

And following is added in regex-urlfilter.txt

# accept anything else
+.

Note: if i add http://www.tutorialspoint.com/ in seed.txt, I'm able to
crawl all other pages but not techcrunch.com's pages though it has got many
other pages too.

Please help..?

Thanks,
Vivek

Reply via email to