[ 
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541508
 ] 

Dennis Kubes commented on NUTCH-574:
------------------------------------

So I think what we are really saying is this.  It would be good to make this a 
plugin and we really don't know what would be the best way to score this right 
now, but it would be good to experiment with it and find out.  So I am going to 
make a generic plugin that turns indexing anchor text on and off.  I am also 
going to create a new extension point from this that will allow creating 
scoring algorithms for indexing anchor text.  That way we can play around.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given 
> URL in the index.  This sometimes allows pages to show up in search results 
> where they may not be relevant.  An example of this is a search for "dallas 
> hotels" in our production index (www.visvo.com).  Google would show up first 
> in this example although there is no text matching either dallas or hotels on 
> the google home page.  What is happening here is there are inlinks into 
> google with the words dallas and hotels which get included in the index for 
> google.com and because google would have a very high boost due to inlinks, 
> google shows up first for these search terms.  I propose we add an option to 
> allow/prevent inlink anchor text from being included in the index and set the 
> default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to