I dont think this has anything to so with modifying the crawl src. It
doesn't infact have anything to do with optimization either. Try using your
URLFilters e.g. regex

It is important to try and understand what type of pages we can filter out
from a Nutch crawl using the filters provided.

HTH

On Wed, Jul 20, 2011 at 11:04 AM, Cheng Li <chen...@usc.edu> wrote:

> Hi,
>
>    I tried to use Nutch to crawl craiglist.   The seed I use is
>
>
>
>
>    http://losangeles.craigslist.org/wst/ctd/
> http://losangeles.craigslist.org/sfv/ctd/
> http://losangeles.craigslist.org/lac/ctd/
> http://losangeles.craigslist.org/sgv/ctd/
> http://losangeles.craigslist.org/lgb/ctd/
> http://losangeles.craigslist.org/ant/ctd/
>
> http://losangeles.craigslist.org/wst/cto/
> http://losangeles.craigslist.org/sfv/cto/
> http://losangeles.craigslist.org/lac/cto/
> http://losangeles.craigslist.org/sgv/cto/
> http://losangeles.craigslist.org/lgb/cto/
> http://losangeles.craigslist.org/ant/cto/
>
>
>  What I want to get is the result page like this one , for example ,
> http://losangeles.craigslist.org/lac/ctd/2501038362.html  , which is a
> specific car selling page .
>  What I DON'T what to get is the result page like this one , for example ,
> http://losangeles.craigslist.org/cta/.
>
>  However , in my query result , I can always have results like
> http://losangeles.craigslist.org/cta/.
>
>  Actually , I can get this kind of this website from craiglist, just part
> of
> them , but not all of them.  I tried to adjust the crawl command line
> parameter, but there is no much change .
>
>  So what I plan to do is to modify the crawl code in Nutch src code. Where
> can I start ?  What kind of work can I do to optimize the crawl process in
> src code ?
>
> --
> Cheng Li
>



-- 
*Lewis*

Reply via email to