That clears some of my problems. BTW, I am merging all the segments together. Thanks.
Dennis Kubes-2 wrote: > > I don't believe that you can set it for a single site. Setting the > variable would affect all sites, but it should increase relevancy if you > are doing web crawls and not constrained crawls. This is because the > default nutch scores internal and external links equally. > > Are you merging segments or merging segment indexes? Either way I don't > think scores are recalculated. Scores are calculated primarily on the > parse process (distributing score to outlinks) and on the update db > process (gathering inlink score and updating crawldatum). > > Dennis Kubes > > karthik085 wrote: >> Thanks for the quick reply. >> >> Is there anyway I can set this score for one specific site? As I said >> earlier, I crawl a handful of sites - 1 site has lot of search results as >> they have high scores (many internal links and possibly, anchor >> pollution) - >> other site pages does not have many incoming internal links and anchor >> text >> are either useless or empty. >> >> At the end, I merge all the crawled segments into one for faster >> searching - >> won't the scores be recalculated here again? Setting the score for >> db.score.link.internal variable would then affect all sites. Won't it? >> >> Please correct me if I am wrong. >> >> >> Dennis Kubes-2 wrote: >>> Well, the short answer is it doesn't Even if you set internal links to >>> be ignored they are still calculated in the OPIC scoring and this >>> negatively affects search relevancy. The way to handle this is to set >>> the db.score.link.internal variable to 0.0. This way only external >>> links are counted in OPIC. >>> >>> I will post a wiki entry about this process soon. >>> >>> Dennis Kubes >>> >>> karthik085 wrote: >>>> Hi, >>>> >>>> I was wondering how does db.ignore.internal.links affect rankings on >>>> PageRank and OPIC algiorithm? I searched on the forum - couldn't get a >>>> clear-cut answer. >>>> >>>> I am using Nutch 0.7.2 to crawl & index handful of sites. One site - >>>> has >>>> lot >>>> of pages and interlinks - around 1/3 of my total pages are from this >>>> site >>>> - >>>> hence, when I search for something and hit 'Show All Hits' - first 5-10 >>>> pages are from this site - before any results from other sites are >>>> shown. >>>> How will db.ignore.internal.links help in this case? >>>> >>>> Of course, I will have to recrawl with nutch-0.9 to use OPIC >>>> algorithm...:-( >>>> >>>> Thanks. >>> >> > > -- View this message in context: http://www.nabble.com/db.ignore.internal.links-and-ranking-algorithms-tf4767180.html#a13641033 Sent from the Nutch - Dev mailing list archive at Nabble.com.
