Hi Cheng, Please see this wiki page for some references to optimization [1]
I can see your problem though. I think a possible solution may to have two seed directories, with a specifically tailored Nutch implementation ready to crawl both. This way we guarantee top results if we take site in a case by case basis. Please feel free to add any further comments to this wiki page based upon your personal experiences moving towards optimization. Thanks [1] http://wiki.apache.org/nutch/OptimizingCrawls On Sat, Jul 16, 2011 at 2:23 AM, Cheng Li <chen...@usc.edu> wrote: > Hi, > > I have some questions for the optimization. > > > 1) > for the command > > bin/nutch crawl url -dir mydir -depth 2 -threads 4 -topN 50 > >&logs/logs1.log > > , > > I know the meaning of parameter , say , > > -depth 8 the maximum depth of links crawled is 8 (8 levels down from > the seed urls) > > -topN 50000 maximum number of links/pages can be crawled at each depth > -thread 16 issue 16 threads simultaneously > > > but how to choose the proper number for each parameter? For example ,in > craiglist web site , the usual url for a certain car goes like this: > http://losangeles.craigslist.org/sgv/cto/2496560420.html > > > But in Kbb.com, the usual url for a certain car goes like this: > > http://www.kbb.com/volkswagen/jetta/2003-volkswagen-jetta/gls-sedan-4d/?vehicleid=348329&intent=buy-used&options=4098815|true|4098881|true&pricetype=private-party&condition=good > > > how to determine the value of parameter for these 2 example ? > > > > 2) When I check the data in Luke in overview panel, I found that on the > left > side (available fields and term counts per field table)the anchor number > value is zero , while the content value is not, and on the right side (top > ranking terms table) all the rank values are also the same.I want to know > the reason that it displays the information like this. > > > Thanks, > > -- > Cheng Li > -- *Lewis*