Hi Cheng,

Please see this wiki page for some references to optimization [1]

I can see your problem though. I think a possible solution may to have two
seed directories, with a specifically tailored Nutch implementation ready to
crawl both. This way we guarantee top results if we take site in a case by
case basis. Please feel free to add any further comments to this wiki page
based upon your personal experiences moving towards optimization.

Thanks

[1] http://wiki.apache.org/nutch/OptimizingCrawls

On Sat, Jul 16, 2011 at 2:23 AM, Cheng Li <chen...@usc.edu> wrote:

> Hi,
>
> I have some questions for the optimization.
>
>
> 1)
> for the command
>
> bin/nutch crawl url -dir mydir -depth 2 -threads 4 -topN 50
> >&logs/logs1.log
>
>   ,
>
>  I know the meaning of parameter  , say ,
>
> -depth 8 the maximum depth of links crawled is 8 (8 levels down from
> the seed urls)
>
> -topN 50000 maximum number of links/pages can be crawled at each depth
> -thread 16 issue 16 threads simultaneously
>
>
> but how to choose the proper number for each parameter?  For example ,in
> craiglist  web site , the usual url for a certain car goes like this:
> http://losangeles.craigslist.org/sgv/cto/2496560420.html
>
>
>  But in Kbb.com,   the usual  url for a certain car goes like this:
>
> http://www.kbb.com/volkswagen/jetta/2003-volkswagen-jetta/gls-sedan-4d/?vehicleid=348329&intent=buy-used&options=4098815|true|4098881|true&pricetype=private-party&condition=good
>
>
> how to determine the value of parameter for these 2 example ?
>
>
>
> 2) When I check the data in Luke in overview panel, I found that on the
> left
> side (available fields and term counts per field table)the anchor number
> value is zero , while the content value is not, and on the right side (top
> ranking terms table) all the rank values are also the same.I want to know
> the reason that it displays the information like this.
>
>
> Thanks,
>
> --
> Cheng Li
>



-- 
*Lewis*

Reply via email to