help, src modify to optimize the crawl

2011-07-20 Thread Cheng Li
Hi, I tried to use Nutch to crawl craiglist. The seed I use is http://losangeles.craigslist.org/wst/ctd/ http://losangeles.craigslist.org/sfv/ctd/ http://losangeles.craigslist.org/lac/ctd/ http://losangeles.craigslist.org/sgv/ctd/ http://losangeles.craigslist.org/lgb/ctd/

Re: help, src modify to optimize the crawl

2011-07-20 Thread lewis john mcgibbney
I dont think this has anything to so with modifying the crawl src. It doesn't infact have anything to do with optimization either. Try using your URLFilters e.g. regex It is important to try and understand what type of pages we can filter out from a Nutch crawl using the filters provided. HTH