The following question appeared on the corpora mailing list...

>
> I'm very interested in finding a list of n-letter English words: I'm mostly
> after 4 and 5 letter words
>
> I'm also interested in two-letter character groupings; that are normally
> found in English, and that are pronounceable -
>

The Ngram Statistics Package might be helpful, if you are interested in
compiling this information for specific corpora.

http://ngram.sourceforge.net is where you can get NSP, then to
figure out four grams and two grams in corpora, you just need to do
something like this...

count.pl --ngram 1 --token mytokenfile.txt outputfilename inputfilename

Where mytokenfile.txt contains a regular expression of the form

/\b\w\w\w\w\b/

which will give you all 4 character words. In fact, count.pl will report
their frequency counts.

You can find the 2 character sequences (and their frequencies) by
changing mytokenfile.txt to

/\w\w\/

That will chop your corpora up into little two character pieces. It won't
tell you which are pronounceable, but it will give you a frequency list
for all the 2 character sequences that occur.

I hope this helps. Please let us know if you have any questions.

Cordially,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


------------------------ Yahoo! Groups Sponsor --------------------~--> 
Fair play? Video games influencing politics. Click and talk back!
http://us.click.yahoo.com/T8sf5C/tzNLAA/TtwFAA/dpFolB/TM
--------------------------------------------------------------------~-> 

 
Yahoo! Groups Links

<*> To visit your group on the web, go to:
    http://groups.yahoo.com/group/ngram/

<*> To unsubscribe from this group, send an email to:
    [EMAIL PROTECTED]

<*> Your use of Yahoo! Groups is subject to:
    http://docs.yahoo.com/info/terms/
 


Reply via email to