+1 on making it into a plugin. Echoing Chris & Andrzej's points -- if Dennis wants to try a novel treatment of inlink text, why not give him a way to do so, so long as the current strategy remains the default?

With luck, experimentation will lead to a better default strategy over time.

--matt

On Nov 9, 2007, at 3:25 PM, Andrzej Bialecki (JIRA) wrote:


[ https://issues.apache.org/jira/browse/NUTCH-574? page=com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanel#action_12541428 ]

Andrzej Bialecki  commented on NUTCH-574:
-----------------------------------------

+1 on making it into a plugin (e.g. index-anchors). -1 on implementing any sort of filtering - as Enis pointed out, the issue is complicated in itself, and additionally depends on the user requirements. I propose the following: let's implement a basic version (which is implemented now in the form of LinkDb.getAnchors ()), and leave users the freedom to complicate away if they wish to do so.

Re: scoring - this is again tricky, because the top-N most frequent words happen to be stopwords, and if that's the case you need to know the language of the corpus in order to properly detect them and remove from the top-N ... very messy.

Including inlink anchor text in index can create irrelevant search results. --------------------------------------------------------------------- ------

                Key: NUTCH-574
                URL: https://issues.apache.org/jira/browse/NUTCH-574
            Project: Nutch
         Issue Type: Bug
         Components: indexer
        Environment: All, basic indexing filter
           Reporter: Dennis Kubes
           Assignee: Dennis Kubes
            Fix For: 1.0.0

        Attachments: NUTCH-574-1.patch


Currently the basic indexing filter includes inbound anchor text for a given URL in the index. This sometimes allows pages to show up in search results where they may not be relevant. An example of this is a search for "dallas hotels" in our production index (www.visvo.com). Google would show up first in this example although there is no text matching either dallas or hotels on the google home page. What is happening here is there are inlinks into google with the words dallas and hotels which get included in the index for google.com and because google would have a very high boost due to inlinks, google shows up first for these search terms. I propose we add an option to allow/prevent inlink anchor text from being included in the index and set the default for this option to NOT include inbound link anchor text.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


--
Matt Kangas / [EMAIL PROTECTED]


Reply via email to