[ 
https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541653
 ] 

Andrzej Bialecki  commented on NUTCH-574:
-----------------------------------------

I don't rule it out - I support the patch as is, i.e. separating the anchor 
indexing from index-basic. My point was that anchor text is a complicated 
issue, and how you use anchor depends on your requirements - in other words, I 
think it may be difficult to find a more advanced solution that would satisfy 
most users.

Some comments to the latest patch:

* I think it would be good to put a NOTE: in CHANGES.txt that reminds users who 
wish to keep the curent behavior that they should make sure that their 
nutch-default / nutch-site.xml contain this plugin in plugin.includes.

* there are literal Tab characters in plugin/build.xml - they should be 
converted to spaces.

Other than that I think the patch can be applied as is, and we should continue 
the discussion :)

> Including inlink anchor text in index can create irrelevant search results.
> ---------------------------------------------------------------------------
>
>                 Key: NUTCH-574
>                 URL: https://issues.apache.org/jira/browse/NUTCH-574
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>         Environment: All, basic indexing filter
>            Reporter: Dennis Kubes
>            Assignee: Dennis Kubes
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch
>
>
> Currently the basic indexing filter includes inbound anchor text for a given 
> URL in the index.  This sometimes allows pages to show up in search results 
> where they may not be relevant.  An example of this is a search for "dallas 
> hotels" in our production index (www.visvo.com).  Google would show up first 
> in this example although there is no text matching either dallas or hotels on 
> the google home page.  What is happening here is there are inlinks into 
> google with the words dallas and hotels which get included in the index for 
> google.com and because google would have a very high boost due to inlinks, 
> google shows up first for these search terms.  I propose we add an option to 
> allow/prevent inlink anchor text from being included in the index and set the 
> default for this option to NOT include inbound link anchor text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to