[nutch 0.8.1]

I want to crawl only web 'html' content, in all xxx-urlfilter.txt I have:
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|swf|iso|pdf|PDF|js|avi|doc)$

and I load only the plugins I need (I think)
<value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>

But when I dump a segment or the linkdb, I can see I have lot of outlinks on
non web-page, like:

  outlink: toUrl: http://domain.com/images/img.gif anchor: 
  outlink: toUrl: http://domain.com/images/img.jpg anchor: 
  outlink: toUrl: http://domain.com/style.css anchor: 

Where can I configure nutch to link only web-page?
I saw the java code in this function:

DOMContentUtils.setConf(Configuration conf)
...
linkParams.put("img", new LinkParams("img", "src", 0));

Maybe I can just comment the line, but it looks not the good way to do it,
it's better with a configuration file.
-- 
View this message in context: 
http://www.nabble.com/How-to-avoid-outlinks-on-jpg-css-...---tf3374937.html#a9391965
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to