I use the prefix and suffix url filters instead of the regex url filter.

You can use suffix filters by making sure the plugin.includes variable 
in the nutch-*.xml file has the urlfilters configured with the urlfilter 
variable like so, you currently have urlfilter-regex:

urlfilter-(prefix|suffix)...


Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt 
files in the conf directory. Below is a configuration that only crawls 
http pages with specific suffixes. On the suffix we start by allowing 
everything and then specifically deny certain file types.

Dennis Kubes

# prefix-urlfilter.txt file starts here
http
# prefix-urlfilter.txt file ends here

# suffix-urlfilter.txt file starts here
# case-insensitive, allow unknown suffixes
+I
# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
.css
.js
# suffix-urlfilter.txt file ends here

cybercouf wrote:
> [nutch 0.8.1]
> 
> I want to crawl only web 'html' content, in all xxx-urlfilter.txt I have:
> -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|swf|iso|pdf|PDF|js|avi|doc)$
> 
> and I load only the plugins I need (I think)
> <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
> 
> But when I dump a segment or the linkdb, I can see I have lot of outlinks on
> non web-page, like:
> 
>   outlink: toUrl: http://domain.com/images/img.gif anchor: 
>   outlink: toUrl: http://domain.com/images/img.jpg anchor: 
>   outlink: toUrl: http://domain.com/style.css anchor: 
> 
> Where can I configure nutch to link only web-page?
> I saw the java code in this function:
> 
> DOMContentUtils.setConf(Configuration conf)
> ...
> linkParams.put("img", new LinkParams("img", "src", 0));
> 
> Maybe I can just comment the line, but it looks not the good way to do it,
> it's better with a configuration file.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to