Chris, Ahmet - thanks for the responses.

Ahmet - yes, i want to see "run" as a top term + the original words that
formed that term
The reason is that due to mis-stemming, the terms could become non-english.
ex:  "permanent" would stem to "perm", "archive" would become "archiv".

I need to extract a set of keywords from the indexed content - I'd like
these to be correct full english words.

thanks,
thushara

On Fri, Jan 23, 2009 at 2:12 PM, AHMET ARSLAN <iori...@yahoo.com> wrote:

> I didn't understand what exactly you want.
>
> if a document has run(10), running(20), runner(2), runners(8):
> (assuming stemmer reduces all those words to run)
> with non-stemmed you will see:
> running(20)
> run(10)
> runners(8)
> runner(2)
>
> with stemmed you will see:
> run(40)
>
> You want to see run as a top term but also you want to see the original
> words that formed that term?
> run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner
>
> Or do you want to see most frequent terms that passed through stem filter
> verbatim? (terms that stemmer didn't change/modify)
>
> What do you mean by saying "badly stemmed" word?
>
>
> > hi Ahmet,
> >
> > thanks. when i look at the non_stemmed_text field to get
> > the top terms, i
> > will not be getting the useful feature of aggregating many
> > related words
> > into one (which is done by stemming).
> >
> > for ex: if a document has run(10), running(20), runner(2),
> > runners(8) - i
> > would like to see a a "top term" to be
> > "run" here. i think with the
> > non-stemmed solution, i will see run, running, runner,
> > runners as separate
> > top terms so if the term "weather" happens to
> > occur 21 times in the
> > document, it will replace any version of "run" as
> > the top term.
> >
> > of course i could go back to the text field for top terms
> > where i will see
> > "run", but some of the terms in the text field
> > will be non-english (stemmed
> > beyond english, ex: archiv, perman). so how can i tell if a
> > term i see in
> > the text field is a "badly stemmed" word or not?
> >
> > maybe at this point i could use a dictionary? if a term in
> > the text field is
> > not in the dictionary, i would try to find a prefix match
> > from the
> > non-stemmed field? or maybe there's a better way?
> >
> > thanks,
> > thushara
>
>
>
>

Reply via email to