Hi all,
As probably many have already noticed the choices for the suggestion
of a language to use in the context menu of a misspelled word have
been cut down to about 4 at most and especially all the installed
dictionaries are not longer used anymore for that purpose.
The old algorithm iterated over the list of available dictionaries
(i.e. languages for the spellchecker) and if the word was found to
be correct in one of them that dictionaries language was suggested.
Due to performance problems with consumed memories and the time
amount needed for this on some machines that algorithm was dropped.
What we would like to have is a solution for this problem that does
not just change the list of languages to be checked for this by the
spellchecker. Since adding languages to that list will bring back
the above problem and changing the list will only shift the problem
but not solve it.
Thus I think the kind of thing we really need here is a language
guessing component!
Especially one that is small in size of memory usage and fast as well.
(At least if it is compared to using a number of spellcheckers
and/or dictionaries that would otherwise not be instantiated at all!)
So in the hope that someone feels attracted to this task I'll list
some requirements and ideas I think to be useful. Of course all of them
are only suggestions from my side:
- That component must not use the spellchecker or thesaurus to do
this. (There is too much overhead involved when using those.)
- Since we like to use this component for all languages and not only
western languages the implementation should use Unicode instead
of any specific character set like e.g. ISO-8859-1, ISO-8859-5,
MS 1252, etc. The API should of course also use Unicode.
- The return value is likely to be a sequence of
com.sun.star.lang.Locale objects where each entry specifies a
possible language, with languages
of higher probability listed first. Of course a single language (if
correct) would be best.
- I will likely never be necessary to distinguish between variants of the
same language. For example is not a high priority to see if a text is
in US English, UK English, or Canadian English. For the time being it
is perfectly sufficient to just check for only one of those.
- A special return values should be "no language" where it is definetly
sure that the text does not belong to any real language (e.g. the
ASCII control characters or Mathematical symbols or graphical
symbols).
Also something like "don't know" may be a good idea. This is likely to
be the reasonable return value if otherwise the list would be large
without any real high scorer.
Maybe some a list of some constant values for those types would be
nice since if there seems to be the need later on those could be
extended.
For example we could introduce type like "mathematical symbol" or
"interpunctuation", though likely that will never happen.
- As for the interface I think there should be about 2 or three
functions.
a) One that takes only a single character and tries to make a guess
for it.
b) Since for most characters this is likely to be impossible there
should also be a function that uses a larger text portion (most
likely a sentence) and tries to identify the language of that.
c) I'm not sure if a single word is too small a unit to try language
guessing.
Probably it won't work that well. But for words where the
characters could have been already handled by a) it will definetly
work.
- As for the language to have checked. In the far far away future it
would be nice to have checks for all the ones where spellchecker
dictionaries available plus the major ones where we are still missing
spellcheckers for as well. For example the Asian languages.
I will be most happy if at some point the list of supported languages
will be something like this:
English
Portuguese
Catalan
Czech
Danish
Dutch
Finnish
French
Greek
German
Hungarian
Italian
Polish
Russian
Slovak
Spanish
Swedish
Turkish
Arabian
Hebrew
Thai
Chinese (simplified/traditional)
Korean
Japanese
Well for starters I would suggest something like
English
French
German
Spanish
Italian
Portuguese
Russian
Greek
Hebrew
Arabian
Especially for the latter 4 I hope the implementation will be rather
easy since all of them use characters that are specific to their
language (at least AFAIK), thus I readily added them.
For the same reason some(!!) Korean or Japanese text portions should
be easy as well and maybe the same applies for Thai.
As for how to implement this component I think a possible approach might
be to use a combination of identyfing the language according to a single
character and using statistical means for measuring the occurence
(probability) of character sequences in a given text. Of course for the
latter one the longer the text the better are the chances for a good result.
Even scanning for small but significant and most common words might be a
good idea. For example in English words like "a" "the" "I" "you" and
"is" "or" are fairly common and are surely not used in French or German.
Thus having a key set of such words might be useful as well.
I'll go somewhat more into detail now, but since I'm not really familiar
with this topic please don't pin me down on any detail I will write down
here. Actually I might be fairly off from the 'real thing'.
As for guessing the language by a single character there are a number of
character that could be associated to a single or smalle set of language
by their code point in the Unicode set.
For example as far as I know Hebrew has about 24 charcters and no
uppercase or lowercase difference at all. And likely those characters
are listed in the same single and rather short Unicode range. Thus a
single 'if' statement (or a small set of statements) like
if (range_start <= character and character <= range_end) then
language = Hebrew
might already work.
Similar should be true for Arabian and maybe Thai.
And for Greek and Russian the same simple approach may work as well.
It is more complicated for Korean and Japanese since both
languages have native characters while using Chinese characters as well.
But for Hangul (the native Korean characters) and Katakana/Hiragana (the
two native types od Japanese characters) it should be similar.
The problem will be with the Chinese characters. I'm quite unsure if
there is a solution for this on a single character basis.
For other western characters it will be possible to include or exclude
some sets of languages based on the character.
For example the "รค" (in German also called a-Umlaut) may occur in German
and Turkish but not in French, English and Italian.
And for many other characters like the slashed o or the accented e it
will be possible to make similar include/exclude list.
Starting references to read for this approach might be:
www.unicode.org/charts (thanks to HDU for this link!)
and
www.threeweb.ad.jp/logos/#toc
Which often lists the character sets used for those languages in the
"xyz by Computer" section, where "xyz" specifies a language.
Usually it has to be looked up manually to what Unicode code point
those characters match thoug.
Also
www.microsoft.com/typography/unicode/cscp.htm
already lists a larger number of character sets in the "Codepage
reference" section and the links usually already list the Unicode code
points as well.
Generally searching the net for the keywords like "language guessing",
"language unicode ranges", "language guessing by character" or
"language guessing by unicode" will provide some useful links.
As for how to detect the language of word or sentence one thing one
might do is to break down the text into single words and for all those
build n-grams (for example tri-grams) count all of them and assign them
probabilities for occurence in the text.
If one now has a reference table for tri-gram probability in different
languages it should be possible to identify those matching the ones we
calculated best.
I probably should give an example for tri-grams now.
AFAIR the list of tri-grams for a given word is obtained by taking the
first three characters and then always shifting the 'focus' one
character further.
An example:
The trigrams for the single word "algorithm" would be:
alg, lgo, gor, ori, rit, ith and thm
and since there are no duplicates they all have the same probabilty.
Since prefix and suffixes are likely to be quite language specific (at
least for western languages) having a look at the n-grams of the start
and end of the word should be especially useful. It is n-grams here and
not tri-grams because prefixes and suffixes are not alway three
characters long. ^_-
Looking at the prefiy and suffix is likely to be most useful for single
words since the text is so short that we can't get good results by just
having a look on the probabilties since the number of tri-grams to get
from the word is quite small.
As for breaking up the text into words the breakiterator from i18n can
be used. It may not be similar useful for all languages but it can
handle all of them and already exists. Therefore unless one is not
content with its result there is no need to implement this functionality
by oneself.
A good reading for this kind of approach might be
www2.iicm.edu/cguetl/education/projects/mrinn/seminar/index.html
Unfortunately it is in German.
But I think if you search the web you'll find something similar in
English eventually.
Useful keywords should be "language guessing by word" or
"language guessing by sentence".
Well, that's it. I hope I got someone interested in the matter.
I'm not in the Office again before Monday thus I cannot answer any
questions before that.
Regards,
Thomas
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]