Hi!
[Geoff Hutchison]
> At 7:52 PM +0100 3/23/01, Jochen Eisinger wrote:
> >The "base" approach is the following:
> > - all words of a document are taken
> > - words in a "stop list" of general words are ignored
> > - the roots of the words are determined (similiar to htfuzzy word2root)
> >
> >Now, I'm looking for ways to improve this (i.e. to reduce the size of the
> >list without loosing much information).
>
> I would generally take a look at word frequencies from the resulting
> list and toss out very frequent ones since they give very little
> information. For example, on the htdig.org site (counting mailing
> list archives), the words "thread," "subject," "message," etc. are
> all too common to be at all useful. But they may not be in the stop
> list based on a priori knowledge.
Would it be usefull to sort out words, that are contained in others? i.e. if you get
words like "gov" "govern" "government" one would just store "government" and maybe the
information, that also parts of this word are contained in the document.
regards
-- jochen
--
If time heals all wounds, how come the belly button stays the same?
[This is a signature virus, please copy me into your signature file!]
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html