Re: Semi-automatic Index generation?

Devin Asay Thu, 31 Jul 2008 10:44:33 -0700


On Jul 31, 2008, at 2:12 AM, viktoras didziulis wrote:

Hi David,
you might wish to discard the 1000 most frequently used words fromyour
list:
English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf
German: http://german.about.com/library/blwfreq01.htm

Another approach is statistical - take the whole text, sort words by
their frequency (count) of appearance in the text. If you put themon agraph you would notice characteristic 'power law' distribution. Settheabsolute or relative frequency or count value at which to cut thetail.
This tail is what holds all the rare or interesting words of the text.
For example if the text is large you may discard the first 500-1000
words in the list sorted by word count. All words that remain shouldbe
the ones that are more-less interesting.

The easy way produce such a frequency list is by using arrays. The
principle is like this:

local arrayWords
repeat for each word myWord in theText
add 1 to arrayWords[myWord]
end repeat

now the keys are words and values are word counts in arrayWords.

Slick, and so simple. This is going into my script library. Thanks,Viktoras!


Regards,

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University

_______________________________________________
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Semi-automatic Index generation?

Reply via email to