[lingu-dev] Looking for volunteer programmers...

Thomas Lange Wed, 10 Aug 2005 09:12:26 -0700




Hi all,

As probably many have already noticed the choices for the suggestion
of a language to use in the context menu of a misspelled word have
been cut down to about 4 at most and especially all the installed
dictionaries are not longer used anymore for that purpose.

The old algorithm iterated over the list of available dictionaries
(i.e. languages for the spellchecker) and if the word was found to
be correct in one of them that dictionaries language was suggested.

Due to performance problems with consumed memories and the time
amount needed for this on some machines that algorithm was dropped.

What we would like to have is a solution for this problem that does
not just change the list of languages to be checked for this by the
spellchecker. Since adding languages to that list will bring back
the above problem and changing the list will only shift the problem
but not solve it.


Thus I think the kind of thing we really need here is a language
guessing component!

Especially one that is small in size of memory usage and fast as well.
(At least if it is compared to using a number of spellcheckers
and/or dictionaries that would otherwise not be instantiated at all!)



So in the hope that someone feels attracted to this task I'll list
some requirements and ideas I think to be useful. Of course all of them
are only suggestions from my side:

- That component must not use the spellchecker or thesaurus to do
  this. (There is too much overhead involved when using those.)

- Since we like to use this component for all languages and not only
  western languages the implementation should use Unicode instead
  of any specific character set like e.g. ISO-8859-1, ISO-8859-5,
  MS 1252, etc. The API should of course also use Unicode.

- The return value is likely to be a sequence of
  com.sun.star.lang.Locale objects where each entry specifies a
  possible language, with languages
  of higher probability listed first. Of course a single language (if
  correct) would be best.

- I will likely never be necessary to distinguish between variants of the
  same language. For example is not a high priority to see if a text is
  in US English, UK English, or Canadian English. For the time being it
  is perfectly sufficient to just check for only one of those.


- A special return values should be "no language" where it is definetly
  sure that the text does not belong to any real language (e.g. the
  ASCII control characters or Mathematical symbols or graphical
  symbols).
  Also something like "don't know" may be a good idea. This is likely to
  be the reasonable return value if otherwise the list would be large
  without any real high scorer.
  Maybe some a list of some constant values for those types would be
  nice since if there seems to be the need later on those could be
  extended.
  For example we could introduce type like "mathematical symbol" or
  "interpunctuation", though likely that will never happen.

- As for the interface I think there should be about 2 or three
  functions.
  a) One that takes only a single character and tries to make a guess
     for it.
  b) Since for most characters this is likely to be impossible there
     should also be a function that uses a larger text portion (most
     likely a sentence) and tries to identify the language of that.
  c) I'm not sure if a single word is too small a unit to try language
     guessing.
     Probably it won't work that well. But for words where the
     characters could have been already handled by a) it will definetly
     work.

- As for the language to have checked. In the far far away future it
  would be nice to have checks for all the ones where spellchecker
  dictionaries available plus the major ones where we are still missing
  spellcheckers for as well. For example the Asian languages.
  I will be most happy if at some point the list of supported  languages
  will be something like this:
    English
    Portuguese
    Catalan
    Czech
    Danish
    Dutch
    Finnish
    French
    Greek
    German
    Hungarian
    Italian
    Polish
    Russian
    Slovak
    Spanish
    Swedish
    Turkish
    Arabian
    Hebrew
    Thai
    Chinese (simplified/traditional)
    Korean
    Japanese

  Well for starters I would suggest something like
    English
    French
    German
    Spanish
    Italian
    Portuguese
    Russian
    Greek
    Hebrew
    Arabian
  Especially for the latter 4 I hope the implementation will be rather
  easy since all of them use characters that are specific to their
  language (at least AFAIK), thus I readily added them.
  For the same reason some(!!) Korean or Japanese text portions should
  be easy as well and maybe the same applies for Thai.




As for how to implement this component I think a possible approach might
be to use a combination of identyfing the language according to a single
character and using statistical means for measuring the occurence
(probability) of character sequences in a given text. Of course for the
latter one the longer the text the better are the chances for a good result.

Even scanning for small but significant and most common words might be a
good idea. For example in English words like "a" "the" "I" "you" and
"is" "or" are fairly common and are surely not used in French or German.
Thus having a key set of such words might be useful as well.

I'll go somewhat more into detail now, but since I'm not really familiar
with this topic please don't pin me down on any detail I will write down
here. Actually I might be fairly off from the 'real thing'.



As for guessing the language by a single character there are a number of
character that could be associated to a single or smalle set of language
by their code point in the Unicode set.
For example as far as I know Hebrew has about 24 charcters and no
uppercase or lowercase difference at all. And likely those characters
are listed in the same single and rather short Unicode range. Thus a
single 'if' statement (or a small set of statements) like

    if (range_start <= character and character <= range_end) then
        language = Hebrew

might already work.
Similar should be true for Arabian and maybe Thai.
And for Greek and Russian the same simple approach may work as well.

It is more complicated for Korean and Japanese since both
languages have native characters while using Chinese characters as well.
But for Hangul (the native Korean characters) and Katakana/Hiragana (the
two native types od Japanese characters) it should be similar.
The problem will be with the Chinese characters. I'm quite unsure if
there is a solution for this on a single character basis.


For other western characters it will be possible to include or exclude
some sets of languages based on the character.
For example the "ä" (in German also called a-Umlaut) may occur in German
and Turkish but not in French, English and Italian.
And for many other characters like the slashed o or the accented e it
will be possible to make similar include/exclude list.


Starting references to read for this approach might be:
    www.unicode.org/charts      (thanks to HDU for this link!)
and
    www.threeweb.ad.jp/logos/#toc
Which often lists the character sets used for those languages in the
"xyz by Computer" section, where "xyz" specifies a language.
Usually it has to be looked up manually to what Unicode code point
those characters match thoug.

Also
    www.microsoft.com/typography/unicode/cscp.htm
already lists a larger number of character sets in the "Codepage
reference" section and the links usually already list the Unicode code
points as well.


Generally searching the net for the keywords like "language guessing",
"language unicode ranges", "language guessing by character" or
"language guessing by unicode" will provide some useful links.



As for how to detect the language of word or sentence one thing one
might do is to break down the text into single words and for all those
build n-grams (for example tri-grams) count all of them and assign them
probabilities for occurence in the text.
If one now has a reference table for tri-gram probability in different
languages it should be possible to identify those matching the ones we
calculated best.

I probably should give an example for tri-grams now.
AFAIR the list of tri-grams for a given word is obtained by taking the
first three characters and then always shifting the 'focus' one
character further.
An example:
  The trigrams for the single word "algorithm" would be:
    alg, lgo, gor, ori, rit, ith and thm
  and since there are no duplicates they all have the same probabilty.


Since prefix and suffixes are likely to be quite language specific (at
least for western languages) having a look at the n-grams of the start
and end of the word should be especially useful. It is n-grams here and
not tri-grams because prefixes and suffixes are not alway three
characters long. ^_-
Looking at the prefiy and suffix is likely to be most useful for single
words since the text is so short that we can't get good results by just
having a look on the probabilties since the number of tri-grams to get
from the word is quite small.

As for breaking up the text into words the breakiterator from i18n can
be used. It may not be similar useful for all languages but it can
handle all of them and already exists. Therefore unless one is not
content with its result there is no need to implement this functionality
by oneself.

A good reading for this kind of approach might be
    www2.iicm.edu/cguetl/education/projects/mrinn/seminar/index.html
Unfortunately it is in German.

But I think if you search the web you'll find something similar inEnglish eventually.


Useful keywords should be "language guessing by word" or
"language guessing by sentence".



Well, that's it. I hope I got someone interested in the matter.
I'm not in the Office again before Monday thus I cannot answer any
questions before that.


Regards,
Thomas




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[lingu-dev] Looking for volunteer programmers...

Reply via email to