On Wed, 27 Aug 2008 23:12:59 +0200 "Marco Trevisan (Treviño)" <[EMAIL PROTECTED]> babbled:
> Carsten Haitzler (The Rasterman) wrote: > > On Wed, 27 Aug 2008 16:00:50 +0200 "Marco Trevisan (Treviño)" > > <[EMAIL PROTECTED]> babbled: > >> Well, generally for small words there's a correction list, but it's not > >> always complete and often there are words very different from the one > >> I'd like to write, but not that one. So maybe it doesn't search in all > >> the dictionary. I could I try that? > >> However my fingers are not so great... > >> If you want I can send you my dictionaries, so you'll be able to test > >> them in a better way. > > > > hmm. is this english?i am wondering if non-ascii chars are messing it up or > > not. your dictionary may be useful - i have just been going off my 98,000 > > or so entry dict from /usr/share/dict/words which seems to be big enough > > for me it seems and has pretty much everything in it... for english anyway. > > as its used for spellchecking i kind of assumed it'd be good enough for > > typing up sms's and emails :) at least in my tests it is listing all the > > completions i'd expect it to. did you sort -f the illume dict? > > (non-case-sensitive sort)? > > Yes it's sorted and it's an Italian dictionary (so few non-ascii chars); > that's why it has so many words. Consider that an Italian dictionary has > about 120000 words to be declined. > So from a verb in the infinite form I can extract about 50 different > words, from names and from adjectives about 3 for each. > But here (like in the more common occidental languages), in most cases, > only the suffix differs. > > Imho, a way to reduce the size would be allowing a rule to set suffix > and prefix (for composed words) that would reduce the dictionary size. > So, for example, in my dictionary instead of using 50 lines for each > verb I would use only one per one; i.e.: > > Italian verb "parlare" (to talk) would be (not complete) > parl{o,i,a,iamo,ate,ano,avo,avi,ava,avamo,avate,avano,ai,asti,ò,ammo, \ > aste,arono,erò,erai,erà,eremo,erete,eranno,erei,eresti,erebbe, \ > eremmo,ereste,erebbero,ii,iamo,iate,ino,assi,asse,assimo, \ > assero,ino,ando,ante,ato,ata,ati} > > Italian noun "casa" (house) would be > cas{a,e} > > Italian adjective "libero" (free [as freedom]) would be > liber{a,i,o} yup yup. don't worry - i understand why :) i speak several langauges myself (not italian - but i did study latin, and speak french, german, english, japanese, some usable level of portuguese). i definitely get the language issues - for both european and asian languages :) yes. the above would reduce dictionary size. it would make parsing it much harder. right now its nice and simple and should work with pretty much every language i can think of that doesnt use input methods and composition (ie japanese/chinese where you use romanji or pinyin as phonetic representations of words). the good bit is: 1. i can mmap() the file trivially. 2. i can build a quick lookup table by scanning through lines and the first 2 chars of each line - use this 2 char "hash" lookup to jump quickly to my mmaped point - then do a (hopefully short) linear search. i keep the search results iteratively so this means it will start where it left off last time to save more walking. > BTW I don't know if this would improve the keyboard typo-fixing work > (maybe yes if also the suffixes/[prefixes?] are sorted) hmm - no. as long as it is sorted (case-insensitive) at all, then it should work as the algorithm is simple. > Anyway, let me know I should send you the dict I've. it's italian - right? > > illume's dict is 6mb? hmm i guess the raw text there has a lot of > > redundancy :) > Yes and this happens because of the things shown above. And I've made > only a part of the work; i guess that the final dictionary will double > this size. And it won't contain any proper name (City names, Sigles...). hmm. ok. well apart from efficiency of dict size and search lengths a simple dict-format dictionary should be able to work fine. maybe some utf8 handling etc. is busted and words with accents get dumped or stopped at. what i do need is a small set of examples to work from. i can create my own :) i never tested anything with anything other than ascii chars (no accents/umlauts etc.) so thats why i suspect them. > Italian standard linux dictionary (/usr/share/dict/italian) "weights" > 1,2mb but it's mostly incomplete. aaah. ok. i guess that's not great quality then :) > > i tried to keep the dictionary simple in illume but am always willing to > > look at other ways to improve it. though the keyboard is not really a focus > > of mine > > - it's something along the way so there may come a time when i go "well- you > > want it better.. please.. send a patch!"... but its fresh on my plate now, > > so it's active :) > > And this is a great thing. Since this phone without a great virtual > keyboard (like the one you're doing) won't be usable/cool as it should > be. Imho this is the killer tool of illume. thanks :) though really.. there is much more to illume :) > >> Another thing I'd like to suggest you is that imho the backspace/space > >> right-left/left-right dragging is too long. If you try writing using > >> your thumbs you can notice that is hard deleting a word... Imho they > >> should be more sensible. > > > > from illume's TODO file (in svn): > > > > * kbd needs drag for backspace/next word etc. to be shorter > > > > :) already there. :) well - as with accent normalising - there is a marker > > that i realise something needs to be done. > > Nice! :P hehehe - i just haven't done it. that's all. accent char normalising is easy: ñ -> n é -> e ö -> o etc. - just strip any accent (and convert to lower case). what i was wondering was: æ -> ? ß -> ? (maybe s?) and some others where i dont have a simple 1 : 1 noramlisation mapping. so i kept it a simple tolower() and put the FIXME in. :) -- Carsten Haitzler (The Rasterman) <[EMAIL PROTECTED]> _______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community