Proper regex-urlfilter syntax to filter out certain numbers in urls

fxmy wang Mon, 12 Jan 2015 21:20:05 -0800

Hi Nutch users,


We are trying to crawl a forum site with the help of Nutch-2.2.1.

The URLs are like far.boo.com/f?kw=SomeTopic&pn=150
where pn means PageNumber.

The goal, is to filter out those old posts, say I want all those pn>1000
posts filtered.

So in conf/regex-urlfilter.txt I added this above the '# accept anything
else' line.

        -[*!@]                    # skip certain queries
        -pn=[0-9]{4,}$       # filter out pn>1000
        +.                          # accept anything else

And... no effect :(
After some generate-fetch-parse-updatedb circle the URL
far.boo.com/f?kw=SomeTopic&pn=649800 still got fetched.

To verify furthermore I run the command below
        bin/nutch plugin urlfilter-regex
org.apache.nutch.urlfilter.regex.RegexURLFilter [0]
and pasted 'far.boo.com/f?kw=SomeTopic&pn=649800' in, the output is
        +far.boo.com/f?kw=SomeTopic&pn=649800
Seems nutch didn't filter it out.

What is the proper way to deal with numbers in URLs?
Did I do something wrong?
Any advice will be very appreciated.

----------------------------------------------------------------------
[0]http://www.mail-archive.com/user%40nutch.apache.org/msg09536.html
----------------------------------------------------------------------

BR, fxmy

Proper regex-urlfilter syntax to filter out certain numbers in urls

Reply via email to