It seems to me that there are two separate problems:

1) content parsing to avoid site structure -> influences the index and rankings 2) content parsing for KWIC snippet generation -> influences the user perception of the engine.

I'd agree that (2) is quite important for the end user; Richard's continuous text heuristic may actually work for that. I'd extend the meaning of "continuous block" to ignore inline tags such as SPAN, I, B, TT etc, so only certain tags would actually break the content into chunks. Snippets then would be generated from these chunks alone, ignoring the rest of the content. If this heuristic is applied only at snippet-generation time then Andrzej's concern about missing content is not relevant anymore. Of course I realize it is tricky in the current architecture because different filters would be used for KWICs and indexing...

D.



Jérôme Charron wrote:
I think algortihm # 1 is what google uses.
google ignores content that does not change from page to page, as well
as content that isn't part of a pblock of text.

Are you sure?
Take a look at this search results:
http://www.google.com/search?hl=en&hs=otT&lr=&c2coff=1&safe=off&client=firefox-a&rls=org.mozilla:en-US:official&pwst=1&q=+site:gamingalmanac.com+global+gaming+almanac
... and you will notice that menus are indexed by google and displayed in
summaries.

But if you can contribute a HtmlParseFilter with ability to remove menus and
navigation, it will be a real improvement.
A first step, that I have developed in a previous project many years ago is
to remove pages that contains textual content only in links: it avoid
indexing frames or iframes that only contains some navigation text...

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to