I use the prefix and suffix url filters instead of the regex url filter. You can use suffix filters by making sure the plugin.includes variable in the nutch-*.xml file has the urlfilters configured with the urlfilter variable like so, you currently have urlfilter-regex:
urlfilter-(prefix|suffix)... Then you will need the prefix-urlfilter.txt and suffix-urlfilter.txt files in the conf directory. Below is a configuration that only crawls http pages with specific suffixes. On the suffix we start by allowing everything and then specifically deny certain file types. Dennis Kubes # prefix-urlfilter.txt file starts here http # prefix-urlfilter.txt file ends here # suffix-urlfilter.txt file starts here # case-insensitive, allow unknown suffixes +I # prohibit these .gif .jpg .jpeg .bmp .png .ico .css .sit .eps .wmf .zip .ppt .mpg .xls .gz .tar .rpm .rm .tgz .mov .exe .vid .ai .pdf .txt .psd .css .js # suffix-urlfilter.txt file ends here cybercouf wrote: > [nutch 0.8.1] > > I want to crawl only web 'html' content, in all xxx-urlfilter.txt I have: > -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png|swf|iso|pdf|PDF|js|avi|doc)$ > > and I load only the plugins I need (I think) > <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value> > > But when I dump a segment or the linkdb, I can see I have lot of outlinks on > non web-page, like: > > outlink: toUrl: http://domain.com/images/img.gif anchor: > outlink: toUrl: http://domain.com/images/img.jpg anchor: > outlink: toUrl: http://domain.com/style.css anchor: > > Where can I configure nutch to link only web-page? > I saw the java code in this function: > > DOMContentUtils.setConf(Configuration conf) > ... > linkParams.put("img", new LinkParams("img", "src", 0)); > > Maybe I can just comment the line, but it looks not the good way to do it, > it's better with a configuration file. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
