On Wed, 19 Nov 2008 23:25:22 +0100 Pander <[EMAIL PROTECTED]> babbled:
> Hi all, > > Together with http://opentaal.org , I'm working on a special Illume > dictionary for Dutch word completion. It will be available in the near > future. > > Of course this particular word list is very long and contains about > 250,000 words and has a typical loooong tail. Many words or compositions > or occur seldom in average day use. > > What would be a good cut off point in number of words, also in terms of > performance? > > The Portuguese list contains 56,609 words. Is this workable? How many > does the English contain? english is about 98,000, but remember english has very few changes in words for conjugation. i need to change the dict format to account for this and compress better i think. i do need to make a different entered text -> visible word mapping tho. this covers blind qwerty entry for accented words. i.e.: (german) fass -> Faß brotchen -> Brötchen (french) cafe -> café etage -> étage francais -> Français (japanese) sakana -> さかな | 魚 | 肴 | 坂な | 茶菓な | 阪な | 差かな | 左かな | 差かな | 査かな | 鎖かな | サカナ | sakana note that in some languages can have 1 romanised input match multiple (different) displays of that word (japanese is king at this. chinese likely if using pinyin could be similar). right now the dict format doesn't allow for this and sure- i can extend with a list of displayed words so currently non-freq format is: cafe etage with freq: cafe 126 etage 98 i can add a display list: cafe 126 cafe café etage 98 étage but the file will get bigger and bigger and get harder to auto-generate from input data. right now i am unsure of the exact strategy to take... but i'd like to cover as many languages as i can with 1 format and have minimal dict size overhead etc. -- ------------- Codito, ergo sum - "I code, therefore I am" -------------- The Rasterman (Carsten Haitzler) [EMAIL PROTECTED] _______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community