[nutch 0.8.1] I want to crawl only web 'html' content, in all xxx-urlfilter.txt I have: -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|swf|iso|pdf|PDF|js|avi|doc)$
and I load only the plugins I need (I think) <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value> But when I dump a segment or the linkdb, I can see I have lot of outlinks on non web-page, like: outlink: toUrl: http://domain.com/images/img.gif anchor: outlink: toUrl: http://domain.com/images/img.jpg anchor: outlink: toUrl: http://domain.com/style.css anchor: Where can I configure nutch to link only web-page? I saw the java code in this function: DOMContentUtils.setConf(Configuration conf) ... linkParams.put("img", new LinkParams("img", "src", 0)); Maybe I can just comment the line, but it looks not the good way to do it, it's better with a configuration file. -- View this message in context: http://www.nabble.com/How-to-avoid-outlinks-on-jpg-css-...---tf3374937.html#a9391965 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
