The following question appeared on the corpora mailing list... > > I'm very interested in finding a list of n-letter English words: I'm mostly > after 4 and 5 letter words > > I'm also interested in two-letter character groupings; that are normally > found in English, and that are pronounceable - >
The Ngram Statistics Package might be helpful, if you are interested in compiling this information for specific corpora. http://ngram.sourceforge.net is where you can get NSP, then to figure out four grams and two grams in corpora, you just need to do something like this... count.pl --ngram 1 --token mytokenfile.txt outputfilename inputfilename Where mytokenfile.txt contains a regular expression of the form /\b\w\w\w\w\b/ which will give you all 4 character words. In fact, count.pl will report their frequency counts. You can find the 2 character sequences (and their frequencies) by changing mytokenfile.txt to /\w\w\/ That will chop your corpora up into little two character pieces. It won't tell you which are pronounceable, but it will give you a frequency list for all the 2 character sequences that occur. I hope this helps. Please let us know if you have any questions. Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------ Yahoo! Groups Sponsor --------------------~--> Fair play? Video games influencing politics. Click and talk back! http://us.click.yahoo.com/T8sf5C/tzNLAA/TtwFAA/dpFolB/TM --------------------------------------------------------------------~-> Yahoo! Groups Links <*> To visit your group on the web, go to: http://groups.yahoo.com/group/ngram/ <*> To unsubscribe from this group, send an email to: [EMAIL PROTECTED] <*> Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/