Re: need help about urlfilter

Jason Tsai Wed, 15 Jan 2014 22:10:21 -0800

nobody here?...


On Tue, Jan 14, 2014 at 10:02 AM, Jason Tsai <[email protected]> wrote:

> Hi there
> I'm new to Nutch and I have build Nutch2.2.1 with Hadoop1.0.4 include
> Hbase.
> I realized Nutch can filter URL by edit the regex-urlfilter.txt but I
> failed all the time no matter how I try.
> My seed.txt in HDFS only have one URL:http://www.cancer.gov/
> and I'm trying to get the pages' URL like
> http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>example
> only.
> I had tried 
> +^http://www.cancer.gov/cancertopics/druginfo<http://www.cancer.gov/cancertopics/druginfo/lungcancer>.*
> or 
> +^http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>([a-z0-9]*\.)
> and so on in regex-urlfilter.txt. But none of them work.
> Can anyone give a hand?
> Thanks!
>

Re: need help about urlfilter

Reply via email to