[ https://issues.apache.org/jira/browse/NUTCH-574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12541653 ]
Andrzej Bialecki commented on NUTCH-574: ----------------------------------------- I don't rule it out - I support the patch as is, i.e. separating the anchor indexing from index-basic. My point was that anchor text is a complicated issue, and how you use anchor depends on your requirements - in other words, I think it may be difficult to find a more advanced solution that would satisfy most users. Some comments to the latest patch: * I think it would be good to put a NOTE: in CHANGES.txt that reminds users who wish to keep the curent behavior that they should make sure that their nutch-default / nutch-site.xml contain this plugin in plugin.includes. * there are literal Tab characters in plugin/build.xml - they should be converted to spaces. Other than that I think the patch can be applied as is, and we should continue the discussion :) > Including inlink anchor text in index can create irrelevant search results. > --------------------------------------------------------------------------- > > Key: NUTCH-574 > URL: https://issues.apache.org/jira/browse/NUTCH-574 > Project: Nutch > Issue Type: Bug > Components: indexer > Environment: All, basic indexing filter > Reporter: Dennis Kubes > Assignee: Dennis Kubes > Fix For: 1.0.0 > > Attachments: NUTCH-574-1.patch, NUTCH-574-2.patch, NUTCH-574-3.patch > > > Currently the basic indexing filter includes inbound anchor text for a given > URL in the index. This sometimes allows pages to show up in search results > where they may not be relevant. An example of this is a search for "dallas > hotels" in our production index (www.visvo.com). Google would show up first > in this example although there is no text matching either dallas or hotels on > the google home page. What is happening here is there are inlinks into > google with the words dallas and hotels which get included in the index for > google.com and because google would have a very high boost due to inlinks, > google shows up first for these search terms. I propose we add an option to > allow/prevent inlink anchor text from being included in the index and set the > default for this option to NOT include inbound link anchor text. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.