+1 on making it into a plugin. Echoing Chris & Andrzej's points -- if
Dennis wants to try a novel treatment of inlink text, why not give
him a way to do so, so long as the current strategy remains the default?
With luck, experimentation will lead to a better default strategy
over time.
--matt
On Nov 9, 2007, at 3:25 PM, Andrzej Bialecki (JIRA) wrote:
[ https://issues.apache.org/jira/browse/NUTCH-574?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#action_12541428 ]
Andrzej Bialecki commented on NUTCH-574:
-----------------------------------------
+1 on making it into a plugin (e.g. index-anchors). -1 on
implementing any sort of filtering - as Enis pointed out, the issue
is complicated in itself, and additionally depends on the user
requirements. I propose the following: let's implement a basic
version (which is implemented now in the form of LinkDb.getAnchors
()), and leave users the freedom to complicate away if they wish to
do so.
Re: scoring - this is again tricky, because the top-N most frequent
words happen to be stopwords, and if that's the case you need to
know the language of the corpus in order to properly detect them
and remove from the top-N ... very messy.
Including inlink anchor text in index can create irrelevant search
results.
---------------------------------------------------------------------
------
Key: NUTCH-574
URL: https://issues.apache.org/jira/browse/NUTCH-574
Project: Nutch
Issue Type: Bug
Components: indexer
Environment: All, basic indexing filter
Reporter: Dennis Kubes
Assignee: Dennis Kubes
Fix For: 1.0.0
Attachments: NUTCH-574-1.patch
Currently the basic indexing filter includes inbound anchor text
for a given URL in the index. This sometimes allows pages to show
up in search results where they may not be relevant. An example
of this is a search for "dallas hotels" in our production index
(www.visvo.com). Google would show up first in this example
although there is no text matching either dallas or hotels on the
google home page. What is happening here is there are inlinks
into google with the words dallas and hotels which get included in
the index for google.com and because google would have a very high
boost due to inlinks, google shows up first for these search
terms. I propose we add an option to allow/prevent inlink anchor
text from being included in the index and set the default for this
option to NOT include inbound link anchor text.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
--
Matt Kangas / [EMAIL PROTECTED]