1. try "+\.(html|html)$

    you can use this command to test your setting

    bin/nutch plugin urlfilter-regex
org.apache.nutch.urllter.regex.RegexURLFilter

2. for crawl content with keyword "semantic"

    nutch current not support this configuration, but you can extends
HttpBase like protocol-httpclient plugin. and filter the content without
keyword "semantic".

    you can check this doc to see how to write a plugin for nutch [0]

[0] http://wiki.apache.org/nutch/WritingPluginExample


On Wed, Oct 16, 2013 at 10:51 PM, ozzy19 <[email protected]> wrote:

> Hi all, I would like to use Nutch crawler to only get pages with extension
> html-htm with keyword: "semantic".
> How can I configure it?
> I set the file nutch-site.xml the property "urlfilter.regex.file" and value
> "regex-urlfilter.txt" and the file "regex-urlfilter.txt" I left it as it
> is,
> by changing only the last line, precisely I deleted "+." and in its place I
> added "+\.(html|htm)" but it seems that does not work!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/help-me-with-nutch-tp4095914.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
Don't Grow Old, Grow Up... :-)

Reply via email to