After some discussion in the #dspace channel, here are word frequency 
lists for the Māori language for configuring Māori language support when 
searching.

The first set of words is derived ultimately from the Māori Niupepa 
Collection at http://www.nzdl.org/niupepa These are mainly 19th Century 
newspapers in traditional orthography (= no macrons). The commandline 
used to generate them is:

cat [0-9]*/doc.xml | sed 's/<Metadata [^>]*>[^>]*>//' | sed 
's/<[^>]*>//g' | sed 's/&lt;[^;]*;//g' | grep -vi '[qysdflbvxzc]' | tr 
-cs '[^a-zA-Z]' '\012' |tr '[A-Z]' '[a-z]' |sort | uniq -c | sort -n

    9534 ata
    9673 wha
    9734 kino
    9975 motu
   10025 kahore
   10064 katahi
   10083 tahi
   10086 marama
   10120 whai
   10492 ingoa
   10532 wahine
   10548 wa
   10610 kau
   10685 muri
   10740 heoi
   10748 mau
   10864 pa
   10941 kawanatanga
   11182 kaha
   11299 rangatira
   11308 whaka
   11518 ahua
   11560 taha
   11592 tamariki
   11619 rongo
   11717 hui
   12014 mana
   12016 mohio
   12244 ora
   12353 rua
   12697 take
   13280 puta
   13977 engari
   14079 taku
   14103 wahi
   14201 ona
   14417 ahau
   14789 raua
   15247 ture
   15407 kotahi
   15688 utu
   15813 reira
   16081 kite
   16249 noho
   16332 moni
   16586 tera
   16993 whakaaro
   17184 tino
   17299 iho
   17416 tika
   17721 enei
   18284 ara
   18680 tena
   19048 ranei
   19086 tana
   19758 hoa
   20478 koe
   21323 tikanga
   21592 tatou
   21719 aua
   21987 noa
   22317 tae
   22356 whare
   22374 tu
   22400 etahi
   22415 matou
   22916 kaore
   23346 ake
   23887 rawa
   25629 au
   25816 mate
   26329 pakeha
   27063 tau
   27326 ta
   27380 kore
   27961 koutou
   27977 tonu
   28943 kupu
   31213 tona
   31711 pai
   31881 runga
   31903 korero
   32158 roto
   34556 whenua
   34991 tetahi
   35540 katoa
   36485 no
   36609 nui
   36674 kai
   37706 haere
   38645 iwi
   38904 to
   39048 kei
   39090 ma
   42886 ra
   45527 mahi
   48027 hei
   48219 taua
   50332 na
   50620 ratou
   55736 maori
   56889 kua
   58158 hoki
   61591 ano
   65470 ia
   69170 tenei
   72397 tangata
   72615 mea
   72836 ai
   75123 nei
   77790 atu
   80224 mo
   82101 mai
   90419 me
   98716 kia
  117543 ana
  141715 ka
  147326 ko
  156732 he
  193002 a
  283013 nga
  302250 o
  306391 e
  310758 ki
  474233 i
  833949 te


The second set of words is derived from a private corpus (not 
distributable for copyright reasons). This is modern text (20th and 21st 
Century), primarily in modern orthography (= macrons are used)and 
primarily from government and official channels.  The commandline used 
to generate them is:

cat *.xml | sed 's/<[^ ]* xml:lang="en">[^>]*>//' | sed 's/<[^>]*>//g' | 
tr  ' \(\)\{\}\[\];:",.0-9-' '\012' |grep -vi '[qysdflbvxzc]' |tr 
'[A-Z]' '[a-z]' |sort | uniq -c | sort -n

    3043 ota
    3096 mō
    3136 ara
    3204 kaunihera
    3206 kore
    3303 £
    3346 taha
    3387 tu
    3403 rohe
    3406 iho
    3432 noho
    3462 the
    3588 riihi
    3728 tae
    3787 whakahaere
    3841 nui
    3860 koe
    3887 aua
    3892 etahi
    4055 mau
    4057 tona
    4063 iwi
    4078 tika
    4168 utu
    4209 pukapuka
    4258 poraka
    4278 take
    4391 reira
    4424 wahi
    4533 tekau
    4623 tekiona
    4634 whai
    4636 tonu
    4702 haere
    4721 ā
    4770 tuku
    4935 no
    5099 takiwa
    5102 tono
    5331 ano
    5359 nama
    5383 ingoa
    5560 na
    5586 kupu
    5917 to
    6164 mana
    6295 ake
    6348 mea
    6712 katoa
    7111 mahi
    7113 moni
    7161 kooti
    7233 ratou
    7288 tau
    7704 tikanga
    7845 raro
    7945 kei
    7956 ma
    8562 ranei
    8733 kai
    9005 hoki
    9856 ra
   10239 hei
   10271 tetahi
   10695 ai
   11492 roto
   11612 tenei
   11654 tangata
   11664 runga
   11855 ture
   12186 mai
   12527 ia
   13468 kua
   14336 taua
   15843 nei
   16143 maori
   18231 mo
   18411 ngā
   18833 kia
   21203 atu
   22526 he
   22921 ana
   25268 whenua
   26545 me
   28052 ka
   32916 ko
   44994 a
   52515 nga
   60488 e
   69607 ki
   92490 o
  128901 i
  208905 te


-- 
Stuart Yeates
http://www.nzetc.org/       New Zealand Electronic Text Centre
http://researcharchive.vuw.ac.nz/     Institutional Repository


------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to