Hi there
I'm new to Nutch and I have build Nutch2.2.1 with Hadoop1.0.4 include Hbase.
I realized Nutch can filter URL by edit the regex-urlfilter.txt but I
failed all the time no matter how I try.
My seed.txt in HDFS only have one URL:http://www.cancer.gov/
and I'm trying to get the pages' URL like
http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>example
only.
I had tried
+^http://www.cancer.gov/cancertopics/druginfo<http://www.cancer.gov/cancertopics/druginfo/lungcancer>.*
or
+^http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>([a-z0-9]*\.)
and so on in regex-urlfilter.txt. But none of them work.
Can anyone give a hand?
Thanks!