Re: Semi-automatic Index generation?
On Jul 31, 2008, at 2:12 AM, viktoras didziulis wrote: Hi David, you might wish to discard the 1000 most frequently used words from your list: English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf German: http://german.about.com/library/blwfreq01.htm Another approach is statistical - take the whole text, sort words by their frequency (count) of appearance in the text. If you put them on a graph you would notice characteristic 'power law' distribution. Set the absolute or relative frequency or count value at which to cut the tail. This tail is what holds all the rare or interesting words of the text. For example if the text is large you may discard the first 500-1000 words in the list sorted by word count. All words that remain should be the ones that are more-less interesting. The easy way produce such a frequency list is by using arrays. The principle is like this: local arrayWords repeat for each word myWord in theText add 1 to arrayWords[myWord] end repeat now the keys are words and values are word counts in arrayWords. Slick, and so simple. This is going into my script library. Thanks, Viktoras! Regards, Devin Devin Asay Humanities Technology and Research Support Center Brigham Young University ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Semi-automatic Index generation?
Thanks for the tips! 2008/7/31 viktoras didziulis <[EMAIL PROTECTED]> > Hi David, > > you might wish to discard the 1000 most frequently used words from your > list: > English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf > German: http://german.about.com/library/blwfreq01.htm > > Another approach is statistical - take the whole text, sort words by their > frequency (count) of appearance in the text. If you put them on a graph you > would notice characteristic 'power law' distribution. Set the absolute or > relative frequency or count value at which to cut the tail. This tail is > what holds all the rare or interesting words of the text. For example if the > text is large you may discard the first 500-1000 words in the list sorted by > word count. All words that remain should be the ones that are more-less > interesting. > > The easy way produce such a frequency list is by using arrays. The > principle is like this: > > local arrayWords > repeat for each word myWord in theText > add 1 to arrayWords[myWord] > end repeat > > now the keys are words and values are word counts in arrayWords. > > Best wishes > Viktoras > > > David Bovill wrote: > >> Is there a resource/ index that any one knows of for plain uninteresting >> dull words. I want to take arbitrary chunks of text and search for >> "interesting" words - that is domain specific words that might be useful >> to >> links to create dictionary entries. This would mean creating a list of >> words >> and stripping "the" "it" etc. I am imagining it working like a spelling >> dictionary with the ability to manually edit entries - but I'd like a good >> starting list? Not sure what to search for :) >> ___ >> use-revolution mailing list >> use-revolution@lists.runrev.com >> Please visit this url to subscribe, unsubscribe and manage your >> subscription preferences: >> http://lists.runrev.com/mailman/listinfo/use-revolution >> >> >> >> > > ___ > use-revolution mailing list > use-revolution@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-revolution > ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Semi-automatic Index generation?
Hi David, you might wish to discard the 1000 most frequently used words from your list: English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf German: http://german.about.com/library/blwfreq01.htm Another approach is statistical - take the whole text, sort words by their frequency (count) of appearance in the text. If you put them on a graph you would notice characteristic 'power law' distribution. Set the absolute or relative frequency or count value at which to cut the tail. This tail is what holds all the rare or interesting words of the text. For example if the text is large you may discard the first 500-1000 words in the list sorted by word count. All words that remain should be the ones that are more-less interesting. The easy way produce such a frequency list is by using arrays. The principle is like this: local arrayWords repeat for each word myWord in theText add 1 to arrayWords[myWord] end repeat now the keys are words and values are word counts in arrayWords. Best wishes Viktoras David Bovill wrote: Is there a resource/ index that any one knows of for plain uninteresting dull words. I want to take arbitrary chunks of text and search for "interesting" words - that is domain specific words that might be useful to links to create dictionary entries. This would mean creating a list of words and stripping "the" "it" etc. I am imagining it working like a spelling dictionary with the ability to manually edit entries - but I'd like a good starting list? Not sure what to search for :) ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Semi-automatic Index generation?
Thanks Eric! 2008/7/30 Eric Chatonet <[EMAIL PROTECTED]> > Bonjour David, > > Le 30 juil. 08 à 16:08, David Bovill a écrit : > > > Is there a resource/ index that any one knows of for plain uninteresting >> dull words. I want to take arbitrary chunks of text and search for >> "interesting" words - that is domain specific words that might be useful >> to >> links to create dictionary entries. This would mean creating a list of >> words >> and stripping "the" "it" etc. I am imagining it working like a spelling >> dictionary with the ability to manually edit entries - but I'd like a good >> starting list? Not sure what to search for :) >> > > 1. You might search for what is called 'stopwords' (non interesting words) > using any Internet search engine. > 2. Have a look also at what is called 'stemming': http:// > www.comp.lancs.ac.uk/computing/research/stemming/general/ that allow to > reduce different words to the same form. > 3. I have put on RevOnline an English, French, Italian, Spanish, German and > Portuguese stemmer library (username: sosmartsoftware) that could help you > too. > > Best regards from Paris, > Eric Chatonet. > > Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/ > Email: [EMAIL PROTECTED]/ > > > > ___ > use-revolution mailing list > use-revolution@lists.runrev.com > Please visit this url to subscribe, unsubscribe and manage your > subscription preferences: > http://lists.runrev.com/mailman/listinfo/use-revolution > ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution
Re: Semi-automatic Index generation?
Bonjour David, Le 30 juil. 08 à 16:08, David Bovill a écrit : Is there a resource/ index that any one knows of for plain uninteresting dull words. I want to take arbitrary chunks of text and search for "interesting" words - that is domain specific words that might be useful to links to create dictionary entries. This would mean creating a list of words and stripping "the" "it" etc. I am imagining it working like a spelling dictionary with the ability to manually edit entries - but I'd like a good starting list? Not sure what to search for :) 1. You might search for what is called 'stopwords' (non interesting words) using any Internet search engine. 2. Have a look also at what is called 'stemming': http:// www.comp.lancs.ac.uk/computing/research/stemming/general/ that allow to reduce different words to the same form. 3. I have put on RevOnline an English, French, Italian, Spanish, German and Portuguese stemmer library (username: sosmartsoftware) that could help you too. Best regards from Paris, Eric Chatonet. Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/ Email: [EMAIL PROTECTED]/ ___ use-revolution mailing list use-revolution@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-revolution