That clears some of my problems. BTW, I am merging all the segments together.
Thanks.


Dennis Kubes-2 wrote:
> 
> I don't believe that you can set it for a single site. Setting the 
> variable would affect all sites, but it should increase relevancy if you 
> are doing web crawls and not constrained crawls.  This is because the 
> default nutch scores internal and external links equally.
> 
> Are you merging segments or merging segment indexes?  Either way I don't 
> think scores are recalculated.  Scores are calculated primarily on the 
> parse process (distributing score to outlinks) and on the update db 
> process (gathering inlink score and updating crawldatum).
> 
> Dennis Kubes
> 
> karthik085 wrote:
>> Thanks for the quick reply. 
>> 
>> Is there anyway I can set this score for one specific site? As I said
>> earlier, I crawl a handful of sites - 1 site has lot of search results as
>> they have high scores (many internal links and possibly, anchor
>> pollution) -
>> other site pages does not have many incoming internal links and anchor
>> text
>> are either useless or empty.
>> 
>> At the end, I merge all the crawled segments into one for faster
>> searching -
>> won't the scores be recalculated here again? Setting the score for
>> db.score.link.internal variable would then affect all sites. Won't it?
>> 
>> Please correct me if I am wrong.
>> 
>> 
>> Dennis Kubes-2 wrote:
>>> Well, the short answer is it doesn't  Even if you set internal links to 
>>> be ignored they are still calculated in the OPIC scoring and this 
>>> negatively affects search relevancy.  The way to handle this is to set 
>>> the db.score.link.internal variable to 0.0.  This way only external 
>>> links are counted in OPIC.
>>>
>>> I will post a wiki entry about this process soon.
>>>
>>> Dennis Kubes
>>>
>>> karthik085 wrote:
>>>> Hi,
>>>>
>>>> I was wondering how does db.ignore.internal.links affect rankings on
>>>> PageRank and OPIC algiorithm?  I searched on the forum - couldn't get a
>>>> clear-cut answer.
>>>>
>>>> I am using Nutch 0.7.2 to crawl & index handful of sites. One site -
>>>> has
>>>> lot
>>>> of pages and interlinks - around 1/3 of my total pages are from this
>>>> site
>>>> -
>>>> hence, when I search for something and hit 'Show All Hits' - first 5-10
>>>> pages are from this site - before any results from other sites are
>>>> shown.
>>>> How will db.ignore.internal.links help in this case?
>>>>
>>>> Of course, I will have to recrawl with nutch-0.9 to use OPIC
>>>> algorithm...:-(
>>>>
>>>> Thanks.
>>>
>> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/db.ignore.internal.links-and-ranking-algorithms-tf4767180.html#a13641033
Sent from the Nutch - Dev mailing list archive at Nabble.com.

Reply via email to