Re: Semi-automatic Index generation?

viktoras didziulis Thu, 31 Jul 2008 01:12:44 -0700

Hi David,

you might wish to discard the 1000 most frequently used words from yourlist:

English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf
German: http://german.about.com/library/blwfreq01.htm

Another approach is statistical - take the whole text, sort words bytheir frequency (count) of appearance in the text. If you put them on agraph you would notice characteristic 'power law' distribution. Set theabsolute or relative frequency or count value at which to cut the tail.This tail is what holds all the rare or interesting words of the text.For example if the text is large you may discard the first 500-1000words in the list sorted by word count. All words that remain should bethe ones that are more-less interesting.

The easy way produce such a frequency list is by using arrays. Theprinciple is like this:


local arrayWords
repeat for each word myWord in theText
add 1 to arrayWords[myWord]
end repeat

now the keys are words and values are word counts in arrayWords.

Best wishes
Viktoras


David Bovill wrote:

Is there a resource/ index that any one knows of for plain uninteresting
dull words. I want to take arbitrary chunks of text and search for
"interesting" words - that is domain specific words that might be useful to
links to create dictionary entries. This would mean creating a list of words
and stripping "the" "it" etc. I am imagining it working like a spelling
dictionary with the ability to manually edit entries - but I'd like a good
starting list? Not sure what to search for :)
_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Semi-automatic Index generation?

Reply via email to