Carsten Haitzler (The Rasterman) wrote: > On Thu, 28 Aug 2008 04:37:23 +0200 "Marco Trevisan (Treviño)" <[EMAIL > PROTECTED]> > babbled: >> Carsten Haitzler (The Rasterman) wrote: >>> yup yup. don't worry - i understand why :) i speak several langauges myself >>> (not italian - but i did study latin, and speak french, german, english, >>> japanese, some usable level of portuguese). i definitely get the language >>> issues - for both european and asian languages :) yes. the above would >>> reduce dictionary size. it would make parsing it much harder. >> I suspected this :/. I did hoped to be wrong... > > i am thinking about this... i have some ideas that may improve this... this is > my thought train: > > right now format is either: > word\n > word2\n > etc. > > or: > word 123\n > word 23\n > etc. > (sorted case-insensitive). > > the numbers are "frequency of use" so those used more will have more primary > position in the match list > > 1. add a line skip byte at the start of the line - means skipping to the next > line will be much faster (just jump N bytes as per the byte - if line > 255 > bytes then byte-jump == 0 and skip the slow way until newline (shouldn't be > very > common). > 2. extend the line to be: > > word NNN match1 match2 match3 ~suffix1 ~suffix2\n
Ok, but from the other side this kind of format wouldn't consider the frequency of use of subwords (words composed by the given suffix and a prefix); this, actually, is one of the good point of the current scheme. > so now we have the ability to match and "append" a suffix. suffix is ~XXX and > full replacement words are just listed. this should remain fast as i only > "lookup" on the first word on the line that is the initial match - so it > builds a list of candidates. the problem is that once you exceed the "base" it > needs to dynamically build matches for all combinations of base + extension. > also for full replacements (as in the last 2 lines) it needs to be able to > match these as well, so they end up being full entries too. the real problem > is > generating such a dictionary - i tried to keep the dict format so simple that > it was trivial to generate. but it'd solve your problem. Well, before of testing my heavy dictionary with illume, I hoped that it would have worked well, but I knew that all this redundancy could have caused a problem in parsing (both from the performance point of view and from the memory usage one). I figure that this kind of implementation could help in these situations (that I don't think they're so uncommon, I guess that - at least for other Latin-derived languages - the dictionary file would be really much more greater than the ones in /usr/share/dict). > anyway. if i am going to go expand the dictionary format, i really need to be > careful. i kept it simple because i didn't want to solve the worlds dictionary > problems - i did want to keep it basic but working. as best i can tell the OM > userbase is still mainly western-speaking (yes - i know we have people here > from asia! :) not forgetting! just looking at dealing with the majority > first!) > > anyway... i am mulling this over. the byte-skip may solve some performance > issues, but this means i now need a special dict generator tool. i was trying > to avoid that :( Yes I figured this. Maybe you could support multiple formats (the current scheme and the improved one). Majority wuldn't need a tool to generate the dict except "sort -f". > as per above - your idea of having a list of suffixes lef me on > the above path. i have a feeling it still isn't perfect, but it's an > improvement. it means the dict now knows about prefix and suffix and so when u > type the "root" of a word that is conjugated, the dict can even offer the > conjugated forms as matches. that's good for western langauges Yes, it is and then would be easier to predict (!= typo fix) too. -- Treviño's World - Life and Linux http://www.3v1n0.net/ _______________________________________________ Openmoko community mailing list community@lists.openmoko.org http://lists.openmoko.org/mailman/listinfo/community