Re: Semi-automatic Index generation?

2008-07-31 Thread Devin Asay


On Jul 31, 2008, at 2:12 AM, viktoras didziulis wrote:


Hi David,

you might wish to discard the 1000 most frequently used words from  
your

list:
English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf
German: http://german.about.com/library/blwfreq01.htm

Another approach is statistical - take the whole text, sort words by
their frequency (count) of appearance in the text. If you put them  
on a
graph you would notice  characteristic 'power law' distribution. Set  
the
absolute or relative frequency or count value at which to cut the  
tail.

This tail is what holds all the rare or interesting words of the text.
For example if the text is large you may discard the first 500-1000
words in the list sorted by word count. All words that remain should  
be

the ones that are more-less interesting.

The easy way produce such a frequency list is by using arrays. The
principle is like this:

local arrayWords
repeat for each word myWord in theText
add 1 to arrayWords[myWord]
end repeat

now the keys are words and values are word counts in arrayWords.


Slick, and so simple. This is going into my script library. Thanks,  
Viktoras!


Regards,

Devin

Devin Asay
Humanities Technology and Research Support Center
Brigham Young University

___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Semi-automatic Index generation?

2008-07-31 Thread David Bovill
Thanks for the tips!

2008/7/31 viktoras didziulis <[EMAIL PROTECTED]>

> Hi David,
>
> you might wish to discard the 1000 most frequently used words from your
> list:
> English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf
> German: http://german.about.com/library/blwfreq01.htm
>
> Another approach is statistical - take the whole text, sort words by their
> frequency (count) of appearance in the text. If you put them on a graph you
> would notice  characteristic 'power law' distribution. Set the absolute or
> relative frequency or count value at which to cut the tail. This tail is
> what holds all the rare or interesting words of the text. For example if the
> text is large you may discard the first 500-1000 words in the list sorted by
> word count. All words that remain should be the ones that are more-less
> interesting.
>
> The easy way produce such a frequency list is by using arrays. The
> principle is like this:
>
> local arrayWords
> repeat for each word myWord in theText
> add 1 to arrayWords[myWord]
> end repeat
>
> now the keys are words and values are word counts in arrayWords.
>
> Best wishes
> Viktoras
>
>
> David Bovill wrote:
>
>> Is there a resource/ index that any one knows of for plain uninteresting
>> dull words. I want to take arbitrary chunks of text and search for
>> "interesting" words - that is domain specific words that might be useful
>> to
>> links to create dictionary entries. This would mean creating a list of
>> words
>> and stripping "the" "it" etc. I am imagining it working like a spelling
>> dictionary with the ability to manually edit entries - but I'd like a good
>> starting list? Not sure what to search for :)
>> ___
>> use-revolution mailing list
>> use-revolution@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-revolution
>>
>>
>>
>>
>
> ___
> use-revolution mailing list
> use-revolution@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
>
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Semi-automatic Index generation?

2008-07-31 Thread viktoras didziulis

Hi David,

you might wish to discard the 1000 most frequently used words from your 
list:

English: http://web1.d25.k12.id.us/home/curriculum/fuw.pdf
German: http://german.about.com/library/blwfreq01.htm

Another approach is statistical - take the whole text, sort words by 
their frequency (count) of appearance in the text. If you put them on a 
graph you would notice  characteristic 'power law' distribution. Set the 
absolute or relative frequency or count value at which to cut the tail. 
This tail is what holds all the rare or interesting words of the text. 
For example if the text is large you may discard the first 500-1000 
words in the list sorted by word count. All words that remain should be 
the ones that are more-less interesting.


The easy way produce such a frequency list is by using arrays. The 
principle is like this:


local arrayWords
repeat for each word myWord in theText
add 1 to arrayWords[myWord]
end repeat

now the keys are words and values are word counts in arrayWords.

Best wishes
Viktoras


David Bovill wrote:

Is there a resource/ index that any one knows of for plain uninteresting
dull words. I want to take arbitrary chunks of text and search for
"interesting" words - that is domain specific words that might be useful to
links to create dictionary entries. This would mean creating a list of words
and stripping "the" "it" etc. I am imagining it working like a spelling
dictionary with the ability to manually edit entries - but I'd like a good
starting list? Not sure what to search for :)
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


  


___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Semi-automatic Index generation?

2008-07-30 Thread David Bovill
Thanks Eric!

2008/7/30 Eric Chatonet <[EMAIL PROTECTED]>

> Bonjour David,
>
> Le 30 juil. 08 à 16:08, David Bovill a écrit :
>
>
>  Is there a resource/ index that any one knows of for plain uninteresting
>> dull words. I want to take arbitrary chunks of text and search for
>> "interesting" words - that is domain specific words that might be useful
>> to
>> links to create dictionary entries. This would mean creating a list of
>> words
>> and stripping "the" "it" etc. I am imagining it working like a spelling
>> dictionary with the ability to manually edit entries - but I'd like a good
>> starting list? Not sure what to search for :)
>>
>
> 1. You might search for what is called 'stopwords' (non interesting words)
> using any Internet search engine.
> 2. Have a look also at what is called 'stemming': http://
> www.comp.lancs.ac.uk/computing/research/stemming/general/ that allow to
> reduce different words to the same form.
> 3. I have put on RevOnline an English, French, Italian, Spanish, German and
> Portuguese stemmer library (username: sosmartsoftware) that could help you
> too.
>
> Best regards from Paris,
> Eric Chatonet.
> 
> Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
> Email: [EMAIL PROTECTED]/
> 
>
>
> ___
> use-revolution mailing list
> use-revolution@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-revolution
>
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: Semi-automatic Index generation?

2008-07-30 Thread Eric Chatonet

Bonjour David,

Le 30 juil. 08 à 16:08, David Bovill a écrit :

Is there a resource/ index that any one knows of for plain  
uninteresting

dull words. I want to take arbitrary chunks of text and search for
"interesting" words - that is domain specific words that might be  
useful to
links to create dictionary entries. This would mean creating a list  
of words
and stripping "the" "it" etc. I am imagining it working like a  
spelling
dictionary with the ability to manually edit entries - but I'd like  
a good

starting list? Not sure what to search for :)


1. You might search for what is called 'stopwords' (non interesting  
words) using any Internet search engine.
2. Have a look also at what is called 'stemming': http:// 
www.comp.lancs.ac.uk/computing/research/stemming/general/ that allow  
to reduce different words to the same form.
3. I have put on RevOnline an English, French, Italian, Spanish,  
German and Portuguese stemmer library (username: sosmartsoftware)  
that could help you too.


Best regards from Paris,
Eric Chatonet.

Plugins and tutorials for Revolution: http://www.sosmartsoftware.com/
Email: [EMAIL PROTECTED]/



___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution


Semi-automatic Index generation?

2008-07-30 Thread David Bovill
Is there a resource/ index that any one knows of for plain uninteresting
dull words. I want to take arbitrary chunks of text and search for
"interesting" words - that is domain specific words that might be useful to
links to create dictionary entries. This would mean creating a list of words
and stripping "the" "it" etc. I am imagining it working like a spelling
dictionary with the ability to manually edit entries - but I'd like a good
starting list? Not sure what to search for :)
___
use-revolution mailing list
use-revolution@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution