Hi,
I'm testing Nutch and until now everything works fine (ok, some hours spent
in reading, testing, testing and testing but it's normal.
I have a noob question: I have to crawl websites only within a ccTLD.

In the crawl-urlfilter.txt should I wright so:

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*.ch/


or so

# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*ch/


The difference is the dot before the "ch" ccTLD. I mean, the dot before the
bracket is already dividing the ccTLD and the name (or the root and a
subdomain) or sould I add one like in the first exemple? In the installation
guide I can see:

+^http://([a-z0-9]*\.)*apache.org/

Is crawling every subdomain of apache.org (xxx.apache.org) or is
crawling apache.org?

Many thanks for any help
Mauro

Reply via email to