nobody here?...
On Tue, Jan 14, 2014 at 10:02 AM, Jason Tsai <[email protected]> wrote: > Hi there > I'm new to Nutch and I have build Nutch2.2.1 with Hadoop1.0.4 include > Hbase. > I realized Nutch can filter URL by edit the regex-urlfilter.txt but I > failed all the time no matter how I try. > My seed.txt in HDFS only have one URL:http://www.cancer.gov/ > and I'm trying to get the pages' URL like > http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>example > only. > I had tried > +^http://www.cancer.gov/cancertopics/druginfo<http://www.cancer.gov/cancertopics/druginfo/lungcancer>.* > or > +^http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>([a-z0-9]*\.) > and so on in regex-urlfilter.txt. But none of them work. > Can anyone give a hand? > Thanks! >

