Chris, Ahmet - thanks for the responses. Ahmet - yes, i want to see "run" as a top term + the original words that formed that term The reason is that due to mis-stemming, the terms could become non-english. ex: "permanent" would stem to "perm", "archive" would become "archiv".
I need to extract a set of keywords from the indexed content - I'd like these to be correct full english words. thanks, thushara On Fri, Jan 23, 2009 at 2:12 PM, AHMET ARSLAN <iori...@yahoo.com> wrote: > I didn't understand what exactly you want. > > if a document has run(10), running(20), runner(2), runners(8): > (assuming stemmer reduces all those words to run) > with non-stemmed you will see: > running(20) > run(10) > runners(8) > runner(2) > > with stemmed you will see: > run(40) > > You want to see run as a top term but also you want to see the original > words that formed that term? > run(40) => 20 from running, 10 from run, 8 from runners, 2 from runner > > Or do you want to see most frequent terms that passed through stem filter > verbatim? (terms that stemmer didn't change/modify) > > What do you mean by saying "badly stemmed" word? > > > > hi Ahmet, > > > > thanks. when i look at the non_stemmed_text field to get > > the top terms, i > > will not be getting the useful feature of aggregating many > > related words > > into one (which is done by stemming). > > > > for ex: if a document has run(10), running(20), runner(2), > > runners(8) - i > > would like to see a a "top term" to be > > "run" here. i think with the > > non-stemmed solution, i will see run, running, runner, > > runners as separate > > top terms so if the term "weather" happens to > > occur 21 times in the > > document, it will replace any version of "run" as > > the top term. > > > > of course i could go back to the text field for top terms > > where i will see > > "run", but some of the terms in the text field > > will be non-english (stemmed > > beyond english, ex: archiv, perman). so how can i tell if a > > term i see in > > the text field is a "badly stemmed" word or not? > > > > maybe at this point i could use a dictionary? if a term in > > the text field is > > not in the dictionary, i would try to find a prefix match > > from the > > non-stemmed field? or maybe there's a better way? > > > > thanks, > > thushara > > > >