Hi,
I tried to use Nutch to crawl craiglist. The seed I use is
http://losangeles.craigslist.org/wst/ctd/
http://losangeles.craigslist.org/sfv/ctd/
http://losangeles.craigslist.org/lac/ctd/
http://losangeles.craigslist.org/sgv/ctd/
http://losangeles.craigslist.org/lgb/ctd/
I dont think this has anything to so with modifying the crawl src. It
doesn't infact have anything to do with optimization either. Try using your
URLFilters e.g. regex
It is important to try and understand what type of pages we can filter out
from a Nutch crawl using the filters provided.
HTH
2 matches
Mail list logo