[ 
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540877
 ] 

Chris A. Mattmann commented on NUTCH-574:
-----------------------------------------

IMHO what Dennis suggest is fine so long as it's a configurable option, that 
doesn't change the default behavior of the system. That is to say, if Dennis 
wants to make it something that you can turn on or off in the nutch-default.xml 
file, and then commit the default to off (e.g., the way Nutch behaves now), and 
then in his own local environment, simply set it to "on" and maintain that conf 
file locally, then it's probably something that we should think about, since it 
seems to support a use case that Dennis is having and we don't want to shut 
anyone's use case out -- if it can be supported with a configurable option.

My +1 for the patch so long as the default doesn't change Nutch's existing 
behavior.

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given 
> URL in the index.  This sometimes allows pages to show up in search results 
> where they may not be relevant.  An example of this is a search for "dallas 
> hotels" in our production index (www.visvo.com).  Google would show up first 
> in this example although there is no text matching either dallas or hotels on 
> the google home page.  What is happening here is there are inlinks into 
> google with the words dallas and hotels which get included in the index for 
> google.com and because google would have a very high boost due to inlinks, 
> google shows up first for these search terms.  I propose we add an option to 
> allow/prevent inlink anchor text from being included in the index and set the 
> default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to