You can use prefix and suffix filters by making sure the plugin.includes variable in the nutch-*.xml file has the urlfilters configured with the urlfilter variable like so:

urlfilter-(prefix|suffix)...

Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt files in the conf directory. Below is a configuration that only crawls pages that begin with the http protocol and ignores many different file types by suffix. On the prefix only these types are accepted. On the suffix we start by allowing everything and then specifically deny certain file types.

Dennis

# prefix-urlfilter.txt file starts here
http
# prefix-urlfilter.txt file ends here

# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
# suffix-urlfilter.txt file ends here

[EMAIL PROTECTED] wrote:
hi

I want to know whether nutch can be set to crawl specified type files and 
specified name files?

for example: If I crawl a website that contains many document files , and I 
want nutch only crawl pdf and doc files but not html files,how to do?

and another question is can I want nutch only to crawl specified name files 
like index.htm or so ?

thanks in advance


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to