Re: Illume dictionary for Dutch (Nederlands)

2008-11-27 Thread Pander
Is it possible to put comments in the .dic file? If so, in what format?
E.g. only the first couple of lines which start with a #.

Carsten Haitzler (The Rasterman) wrote:
 On Thu, 20 Nov 2008 10:55:02 +0100 (CET) Pander
 [EMAIL PROTECTED] babbled:
 
 any dictionary should not care about gsm encodings. it should be just a utf8
 dictionary file. it is the job of the sms app to convert normal utf8 unicode 
 to
 whatever encoding used by the network, and back. :)
 
 Small correction to my text:

 Note that more characters must be Note that certain special characters
 are in GSM 03.38 which are not in extended ASCII


 Nevertheless, one complete utf-8 dictionary could be used by most
 applications, also SMS. The conversion I do for GSM 03.38 could also be
 done later just before sending the SMS.

 On Thu, November 20, 2008 10:44, Rui Miguel Silva Seabra wrote:
 I have no idea... I might only make a new version with utf-8 encoded
 characters. :)


 On Thu, Nov 20, 2008 at 10:40:46AM +0100, Pander wrote:
 Hi all,

 I intent to generate the following:
 - a full list utf-8 (for 8 bit SMS and regular use, default)
 - b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
 - c truncated list utf-8 (for 8 bit SMS and regular use)
 - d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)

 [1] These utf-8 characters in this list are within the 7-bit range of
 GSM
 03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
 that more characters

 a and b will both have 250,000 words
 b will be conversion, remapping and normalisation of a
 c and d are truncations and normalisation of respectively a and b

 For utf-16, a simple conversion of the utf-8 files can be used, but I'll
 leave this for now. This could result in two extra files.

 Note that nor extended nor non-extended ASCII is available. Is this
 desirable? This can result in four extra files.

 So, I can come up with 10 different files. Which are according to you
 the
 most useful?

 Regards,

 Pander

 On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
 On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan
 (Treviño)
 wrote:
 Pander wrote:
 Of course this particular word list is very long and contains about
 250,000 words and has a typical lng tail. Many words or
 compositions
 or occur seldom in average day use.

 What would be a good cut off point in number of words, also in
 terms
 of
 performance?

 The Portuguese list contains 56,609 words. Is this workable? How
 many
 does the English contain?
 The Italian one can count also 500'000 words (to be short), but I can
 get a well working dictionary only using a smaller one (with about
 150'000 words that I've taken counting its google popularity).

 Btw I've written more complete posts about this on the list...
 Well, since my basis was based on a million words taken from the most
 printed daily newspaper in Portugal (I didn't count but still I
 removed
 a lot of non words like numbers, etc...) already with frequency data,
 my
 job was so much easier... :)

 As for writing SMS/text messages... I haven't found yet a word that
 wasn't there (in fact my problem is that it so often is the first of
 several matches so I have to use the menu on the left) but I must
 confess to not be one of those whose primary use of the phone is
 SMS/text!

 Rui

 --
 Frink!
 Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD
 3174
 + No matter how much you do, you never do enough -- unknown
 + Whatever you do will be insignificant,
 | but it is very important that you do it -- Gandhi
 + So let's do it...?

 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community



 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community
 --
 You are what you see.
 Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
 + No matter how much you do, you never do enough -- unknown
 + Whatever you do will be insignificant,
 | but it is very important that you do it -- Gandhi
 + So let's do it...?

 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community



 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community

 
 


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-27 Thread The Rasterman
On Fri, 28 Nov 2008 00:20:38 +0100 Pander [EMAIL PROTECTED]
babbled:

 Is it possible to put comments in the .dic file? If so, in what format?
 E.g. only the first couple of lines which start with a #.

no. it doesnt support comments.

 Carsten Haitzler (The Rasterman) wrote:
  On Thu, 20 Nov 2008 10:55:02 +0100 (CET) Pander
  [EMAIL PROTECTED] babbled:
  
  any dictionary should not care about gsm encodings. it should be just a utf8
  dictionary file. it is the job of the sms app to convert normal utf8
  unicode to whatever encoding used by the network, and back. :)
  
  Small correction to my text:
 
  Note that more characters must be Note that certain special characters
  are in GSM 03.38 which are not in extended ASCII
 
 
  Nevertheless, one complete utf-8 dictionary could be used by most
  applications, also SMS. The conversion I do for GSM 03.38 could also be
  done later just before sending the SMS.
 
  On Thu, November 20, 2008 10:44, Rui Miguel Silva Seabra wrote:
  I have no idea... I might only make a new version with utf-8 encoded
  characters. :)
 
 
  On Thu, Nov 20, 2008 at 10:40:46AM +0100, Pander wrote:
  Hi all,
 
  I intent to generate the following:
  - a full list utf-8 (for 8 bit SMS and regular use, default)
  - b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
  - c truncated list utf-8 (for 8 bit SMS and regular use)
  - d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)
 
  [1] These utf-8 characters in this list are within the 7-bit range of
  GSM
  03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
  that more characters
 
  a and b will both have 250,000 words
  b will be conversion, remapping and normalisation of a
  c and d are truncations and normalisation of respectively a and b
 
  For utf-16, a simple conversion of the utf-8 files can be used, but I'll
  leave this for now. This could result in two extra files.
 
  Note that nor extended nor non-extended ASCII is available. Is this
  desirable? This can result in four extra files.
 
  So, I can come up with 10 different files. Which are according to you
  the
  most useful?
 
  Regards,
 
  Pander
 
  On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
  On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan
  (Treviño)
  wrote:
  Pander wrote:
  Of course this particular word list is very long and contains about
  250,000 words and has a typical lng tail. Many words or
  compositions
  or occur seldom in average day use.
 
  What would be a good cut off point in number of words, also in
  terms
  of
  performance?
 
  The Portuguese list contains 56,609 words. Is this workable? How
  many
  does the English contain?
  The Italian one can count also 500'000 words (to be short), but I can
  get a well working dictionary only using a smaller one (with about
  150'000 words that I've taken counting its google popularity).
 
  Btw I've written more complete posts about this on the list...
  Well, since my basis was based on a million words taken from the most
  printed daily newspaper in Portugal (I didn't count but still I
  removed
  a lot of non words like numbers, etc...) already with frequency data,
  my
  job was so much easier... :)
 
  As for writing SMS/text messages... I haven't found yet a word that
  wasn't there (in fact my problem is that it so often is the first of
  several matches so I have to use the menu on the left) but I must
  confess to not be one of those whose primary use of the phone is
  SMS/text!
 
  Rui
 
  --
  Frink!
  Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD
  3174
  + No matter how much you do, you never do enough -- unknown
  + Whatever you do will be insignificant,
  | but it is very important that you do it -- Gandhi
  + So let's do it...?
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
 
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
  --
  You are what you see.
  Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
  + No matter how much you do, you never do enough -- unknown
  + Whatever you do will be insignificant,
  | but it is very important that you do it -- Gandhi
  + So let's do it...?
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
 
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
  
  
 
 
 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community


-- 
- Codito, ergo sum 

Re: Illume dictionary for Dutch (Nederlands)

2008-11-27 Thread Guillaume Chereau
On Thu, 2008-11-20 at 10:14 +1100, Carsten Haitzler wrote:
 (japanese)
 sakana - さかな | 魚 | 肴 | 坂な | 茶菓な | 阪な | 差かな | 左かな |
 差かな  |
 査かな | 鎖かな | サカナ | sakana
 

Hi raster, I am curious how we can pass unicode character to
applications like those via X. I though the keyboard could only send
keycode. How does it work with illume keyboard ?

-charlie


signature.asc
Description: This is a digitally signed message part
___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-20 Thread Pander
Hi all,

I intent to generate the following:
- a full list utf-8 (for 8 bit SMS and regular use, default)
- b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
- c truncated list utf-8 (for 8 bit SMS and regular use)
- d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)

[1] These utf-8 characters in this list are within the 7-bit range of GSM
03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
that more characters

a and b will both have 250,000 words
b will be conversion, remapping and normalisation of a
c and d are truncations and normalisation of respectively a and b

For utf-16, a simple conversion of the utf-8 files can be used, but I'll
leave this for now. This could result in two extra files.

Note that nor extended nor non-extended ASCII is available. Is this
desirable? This can result in four extra files.

So, I can come up with 10 different files. Which are according to you the
most useful?

Regards,

Pander

On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
 On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan (Treviño)
 wrote:
 Pander wrote:
  Of course this particular word list is very long and contains about
  250,000 words and has a typical lng tail. Many words or
 compositions
  or occur seldom in average day use.
 
  What would be a good cut off point in number of words, also in terms
 of
  performance?
 
  The Portuguese list contains 56,609 words. Is this workable? How many
  does the English contain?

 The Italian one can count also 500'000 words (to be short), but I can
 get a well working dictionary only using a smaller one (with about
 150'000 words that I've taken counting its google popularity).

 Btw I've written more complete posts about this on the list...

 Well, since my basis was based on a million words taken from the most
 printed daily newspaper in Portugal (I didn't count but still I removed
 a lot of non words like numbers, etc...) already with frequency data, my
 job was so much easier... :)

 As for writing SMS/text messages... I haven't found yet a word that
 wasn't there (in fact my problem is that it so often is the first of
 several matches so I have to use the menu on the left) but I must
 confess to not be one of those whose primary use of the phone is
 SMS/text!

 Rui

 --
 Frink!
 Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
 + No matter how much you do, you never do enough -- unknown
 + Whatever you do will be insignificant,
 | but it is very important that you do it -- Gandhi
 + So let's do it...?

 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community




___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-20 Thread Rui Miguel Silva Seabra
I have no idea... I might only make a new version with utf-8 encoded
characters. :)


On Thu, Nov 20, 2008 at 10:40:46AM +0100, Pander wrote:
 Hi all,
 
 I intent to generate the following:
 - a full list utf-8 (for 8 bit SMS and regular use, default)
 - b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
 - c truncated list utf-8 (for 8 bit SMS and regular use)
 - d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)
 
 [1] These utf-8 characters in this list are within the 7-bit range of GSM
 03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
 that more characters
 
 a and b will both have 250,000 words
 b will be conversion, remapping and normalisation of a
 c and d are truncations and normalisation of respectively a and b
 
 For utf-16, a simple conversion of the utf-8 files can be used, but I'll
 leave this for now. This could result in two extra files.
 
 Note that nor extended nor non-extended ASCII is available. Is this
 desirable? This can result in four extra files.
 
 So, I can come up with 10 different files. Which are according to you the
 most useful?
 
 Regards,
 
 Pander
 
 On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
  On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan (Treviño)
  wrote:
  Pander wrote:
   Of course this particular word list is very long and contains about
   250,000 words and has a typical lng tail. Many words or
  compositions
   or occur seldom in average day use.
  
   What would be a good cut off point in number of words, also in terms
  of
   performance?
  
   The Portuguese list contains 56,609 words. Is this workable? How many
   does the English contain?
 
  The Italian one can count also 500'000 words (to be short), but I can
  get a well working dictionary only using a smaller one (with about
  150'000 words that I've taken counting its google popularity).
 
  Btw I've written more complete posts about this on the list...
 
  Well, since my basis was based on a million words taken from the most
  printed daily newspaper in Portugal (I didn't count but still I removed
  a lot of non words like numbers, etc...) already with frequency data, my
  job was so much easier... :)
 
  As for writing SMS/text messages... I haven't found yet a word that
  wasn't there (in fact my problem is that it so often is the first of
  several matches so I have to use the menu on the left) but I must
  confess to not be one of those whose primary use of the phone is
  SMS/text!
 
  Rui
 
  --
  Frink!
  Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
  + No matter how much you do, you never do enough -- unknown
  + Whatever you do will be insignificant,
  | but it is very important that you do it -- Gandhi
  + So let's do it...?
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
 
 
 
 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community

-- 
You are what you see.
Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
+ No matter how much you do, you never do enough -- unknown
+ Whatever you do will be insignificant,
| but it is very important that you do it -- Gandhi
+ So let's do it...?

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-20 Thread Pander
Small correction to my text:

Note that more characters must be Note that certain special characters
are in GSM 03.38 which are not in extended ASCII


Nevertheless, one complete utf-8 dictionary could be used by most
applications, also SMS. The conversion I do for GSM 03.38 could also be
done later just before sending the SMS.

On Thu, November 20, 2008 10:44, Rui Miguel Silva Seabra wrote:
 I have no idea... I might only make a new version with utf-8 encoded
 characters. :)


 On Thu, Nov 20, 2008 at 10:40:46AM +0100, Pander wrote:
 Hi all,

 I intent to generate the following:
 - a full list utf-8 (for 8 bit SMS and regular use, default)
 - b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
 - c truncated list utf-8 (for 8 bit SMS and regular use)
 - d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)

 [1] These utf-8 characters in this list are within the 7-bit range of
 GSM
 03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
 that more characters

 a and b will both have 250,000 words
 b will be conversion, remapping and normalisation of a
 c and d are truncations and normalisation of respectively a and b

 For utf-16, a simple conversion of the utf-8 files can be used, but I'll
 leave this for now. This could result in two extra files.

 Note that nor extended nor non-extended ASCII is available. Is this
 desirable? This can result in four extra files.

 So, I can come up with 10 different files. Which are according to you
 the
 most useful?

 Regards,

 Pander

 On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
  On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan
 (Treviño)
  wrote:
  Pander wrote:
   Of course this particular word list is very long and contains about
   250,000 words and has a typical lng tail. Many words or
  compositions
   or occur seldom in average day use.
  
   What would be a good cut off point in number of words, also in
 terms
  of
   performance?
  
   The Portuguese list contains 56,609 words. Is this workable? How
 many
   does the English contain?
 
  The Italian one can count also 500'000 words (to be short), but I can
  get a well working dictionary only using a smaller one (with about
  150'000 words that I've taken counting its google popularity).
 
  Btw I've written more complete posts about this on the list...
 
  Well, since my basis was based on a million words taken from the most
  printed daily newspaper in Portugal (I didn't count but still I
 removed
  a lot of non words like numbers, etc...) already with frequency data,
 my
  job was so much easier... :)
 
  As for writing SMS/text messages... I haven't found yet a word that
  wasn't there (in fact my problem is that it so often is the first of
  several matches so I have to use the menu on the left) but I must
  confess to not be one of those whose primary use of the phone is
  SMS/text!
 
  Rui
 
  --
  Frink!
  Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD
 3174
  + No matter how much you do, you never do enough -- unknown
  + Whatever you do will be insignificant,
  | but it is very important that you do it -- Gandhi
  + So let's do it...?
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 



 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community

 --
 You are what you see.
 Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
 + No matter how much you do, you never do enough -- unknown
 + Whatever you do will be insignificant,
 | but it is very important that you do it -- Gandhi
 + So let's do it...?

 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community




___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-20 Thread The Rasterman
On Thu, 20 Nov 2008 10:55:02 +0100 (CET) Pander
[EMAIL PROTECTED] babbled:

any dictionary should not care about gsm encodings. it should be just a utf8
dictionary file. it is the job of the sms app to convert normal utf8 unicode to
whatever encoding used by the network, and back. :)

 Small correction to my text:
 
 Note that more characters must be Note that certain special characters
 are in GSM 03.38 which are not in extended ASCII
 
 
 Nevertheless, one complete utf-8 dictionary could be used by most
 applications, also SMS. The conversion I do for GSM 03.38 could also be
 done later just before sending the SMS.
 
 On Thu, November 20, 2008 10:44, Rui Miguel Silva Seabra wrote:
  I have no idea... I might only make a new version with utf-8 encoded
  characters. :)
 
 
  On Thu, Nov 20, 2008 at 10:40:46AM +0100, Pander wrote:
  Hi all,
 
  I intent to generate the following:
  - a full list utf-8 (for 8 bit SMS and regular use, default)
  - b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
  - c truncated list utf-8 (for 8 bit SMS and regular use)
  - d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)
 
  [1] These utf-8 characters in this list are within the 7-bit range of
  GSM
  03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
  that more characters
 
  a and b will both have 250,000 words
  b will be conversion, remapping and normalisation of a
  c and d are truncations and normalisation of respectively a and b
 
  For utf-16, a simple conversion of the utf-8 files can be used, but I'll
  leave this for now. This could result in two extra files.
 
  Note that nor extended nor non-extended ASCII is available. Is this
  desirable? This can result in four extra files.
 
  So, I can come up with 10 different files. Which are according to you
  the
  most useful?
 
  Regards,
 
  Pander
 
  On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
   On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan
  (Treviño)
   wrote:
   Pander wrote:
Of course this particular word list is very long and contains about
250,000 words and has a typical lng tail. Many words or
   compositions
or occur seldom in average day use.
   
What would be a good cut off point in number of words, also in
  terms
   of
performance?
   
The Portuguese list contains 56,609 words. Is this workable? How
  many
does the English contain?
  
   The Italian one can count also 500'000 words (to be short), but I can
   get a well working dictionary only using a smaller one (with about
   150'000 words that I've taken counting its google popularity).
  
   Btw I've written more complete posts about this on the list...
  
   Well, since my basis was based on a million words taken from the most
   printed daily newspaper in Portugal (I didn't count but still I
  removed
   a lot of non words like numbers, etc...) already with frequency data,
  my
   job was so much easier... :)
  
   As for writing SMS/text messages... I haven't found yet a word that
   wasn't there (in fact my problem is that it so often is the first of
   several matches so I have to use the menu on the left) but I must
   confess to not be one of those whose primary use of the phone is
   SMS/text!
  
   Rui
  
   --
   Frink!
   Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD
  3174
   + No matter how much you do, you never do enough -- unknown
   + Whatever you do will be insignificant,
   | but it is very important that you do it -- Gandhi
   + So let's do it...?
  
   ___
   Openmoko community mailing list
   community@lists.openmoko.org
   http://lists.openmoko.org/mailman/listinfo/community
  
 
 
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
  --
  You are what you see.
  Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
  + No matter how much you do, you never do enough -- unknown
  + Whatever you do will be insignificant,
  | but it is very important that you do it -- Gandhi
  + So let's do it...?
 
  ___
  Openmoko community mailing list
  community@lists.openmoko.org
  http://lists.openmoko.org/mailman/listinfo/community
 
 
 
 
 ___
 Openmoko community mailing list
 community@lists.openmoko.org
 http://lists.openmoko.org/mailman/listinfo/community
 


-- 
- Codito, ergo sum - I code, therefore I am --
The Rasterman (Carsten Haitzler)[EMAIL PROTECTED]


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Illume dictionary for Dutch (Nederlands)

2008-11-19 Thread Pander
Hi all,

Together with http://opentaal.org , I'm working on a special Illume
dictionary for Dutch word completion. It will be available in the near
future.

Of course this particular word list is very long and contains about
250,000 words and has a typical lng tail. Many words or compositions
or occur seldom in average day use.

What would be a good cut off point in number of words, also in terms of
performance?

The Portuguese list contains 56,609 words. Is this workable? How many
does the English contain?

Thanks,

Pander

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-19 Thread Rui Miguel Silva Seabra
On Wed, Nov 19, 2008 at 11:25:22PM +0100, Pander wrote:
 Hi all,
 
 Together with http://opentaal.org , I'm working on a special Illume
 dictionary for Dutch word completion. It will be available in the near
 future.
 
 Of course this particular word list is very long and contains about
 250,000 words and has a typical lng tail. Many words or compositions
 or occur seldom in average day use.
 
 What would be a good cut off point in number of words, also in terms of
 performance?
 
 The Portuguese list contains 56,609 words. Is this workable? How many
 does the English contain?

[EMAIL PROTECTED]:/usr/lib/enlightenment/modules/illume/dicts# wc -w *.dic
   196684 English_(US).dic
10002 English_(US)_Small.dic
   113218 Portuguese (ASCII).dic
   319904 total

So it's a little under the double of the words.

Rui

-- 
Frink!
Today is Pungenday, the 31st day of The Aftermath in the YOLD 3174
+ No matter how much you do, you never do enough -- unknown
+ Whatever you do will be insignificant,
| but it is very important that you do it -- Gandhi
+ So let's do it...?

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-19 Thread The Rasterman
On Wed, 19 Nov 2008 23:25:22 +0100 Pander [EMAIL PROTECTED]
babbled:

 Hi all,
 
 Together with http://opentaal.org , I'm working on a special Illume
 dictionary for Dutch word completion. It will be available in the near
 future.
 
 Of course this particular word list is very long and contains about
 250,000 words and has a typical lng tail. Many words or compositions
 or occur seldom in average day use.
 
 What would be a good cut off point in number of words, also in terms of
 performance?
 
 The Portuguese list contains 56,609 words. Is this workable? How many
 does the English contain?

english is about 98,000, but remember english has very few changes in words for
conjugation. i need to change the dict format to account for this and compress
better i think. i do need to make a different entered text - visible word
mapping tho. this covers blind qwerty entry for accented words. i.e.:

(german)
fass - Faß
brotchen - Brötchen

(french)
cafe - café
etage - étage
francais - Français

(japanese)
sakana - さかな | 魚 | 肴 | 坂な | 茶菓な | 阪な | 差かな | 左かな | 差かな  |
査かな | 鎖かな | サカナ | sakana

note that in some languages can have 1 romanised input match multiple
(different) displays of that word (japanese is king at this. chinese likely if
using pinyin could be similar). right now the dict format doesn't allow for
this and sure- i can extend with a list of displayed words so currently
non-freq format is:

cafe
etage

with freq:

cafe 126
etage 98

i can add a display list:

cafe 126 cafe café
etage 98 étage

but the file will get bigger and bigger and get harder to auto-generate from
input data. right now i am unsure of the exact strategy to take... but i'd like
to cover as many languages as i can with 1 format and have minimal dict size
overhead etc.

-- 
- Codito, ergo sum - I code, therefore I am --
The Rasterman (Carsten Haitzler)[EMAIL PROTECTED]


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-19 Thread Marco Trevisan (Treviño)
Pander wrote:
 Of course this particular word list is very long and contains about
 250,000 words and has a typical lng tail. Many words or compositions
 or occur seldom in average day use.
 
 What would be a good cut off point in number of words, also in terms of
 performance?
 
 The Portuguese list contains 56,609 words. Is this workable? How many
 does the English contain?

The Italian one can count also 500'000 words (to be short), but I can
get a well working dictionary only using a smaller one (with about
150'000 words that I've taken counting its google popularity).

Btw I've written more complete posts about this on the list...

-- 
Treviño's World - Life and Linux
http://www.3v1n0.net/


___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community


Re: Illume dictionary for Dutch (Nederlands)

2008-11-19 Thread Rui Miguel Silva Seabra
On Thu, Nov 20, 2008 at 03:02:41AM +0100, Marco Trevisan (Treviño) wrote:
 Pander wrote:
  Of course this particular word list is very long and contains about
  250,000 words and has a typical lng tail. Many words or compositions
  or occur seldom in average day use.
  
  What would be a good cut off point in number of words, also in terms of
  performance?
  
  The Portuguese list contains 56,609 words. Is this workable? How many
  does the English contain?
 
 The Italian one can count also 500'000 words (to be short), but I can
 get a well working dictionary only using a smaller one (with about
 150'000 words that I've taken counting its google popularity).
 
 Btw I've written more complete posts about this on the list...

Well, since my basis was based on a million words taken from the most
printed daily newspaper in Portugal (I didn't count but still I removed
a lot of non words like numbers, etc...) already with frequency data, my
job was so much easier... :)

As for writing SMS/text messages... I haven't found yet a word that
wasn't there (in fact my problem is that it so often is the first of
several matches so I have to use the menu on the left) but I must
confess to not be one of those whose primary use of the phone is
SMS/text!

Rui

-- 
Frink!
Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
+ No matter how much you do, you never do enough -- unknown
+ Whatever you do will be insignificant,
| but it is very important that you do it -- Gandhi
+ So let's do it...?

___
Openmoko community mailing list
community@lists.openmoko.org
http://lists.openmoko.org/mailman/listinfo/community