Re: [lingu-dev] Looking for volunteer programmers...

Jonathon Blake Mon, 15 Aug 2005 14:10:33 -0700

Thomas wrote:

> Thus I think the kind of thing we really need here is a language guessing 
> component!


Is there any reason why it can not simply use the languages defined in
the numbering/character/paragraph/cell styles?

At most, that would be nine languages.  By using Unicode sub ranges
that could be reduced to a maximum of three. I realize that this would
eliminate spellchecking, when the language is set to either "none",
"user-1", or "user-2".  [The "check all languages" option in OOo
currently manages to select the wrong dictionary when those settings
are used.]

[On a semi-related note, am I the only person who finds it amusing
that I can have a user interface in a language, for which I can set
neither the locale data, nor the language.  Nor include my
spellchecker for that language?  Especially since locale data has been
submitted to issuezilla.]

>    would be nice to have checks for all the ones where spellchecker 
> dictionaries available plus the major ones where we are still missing 
> spellcheckers 

Having a spell checker decide that I am writing in English, or
Esperanto, because I use "Tito" ten times in a paragraph is not going
to be acceptable behaviour, when the passage is in Spanish.

>      Russian

Cyrillic Writing System.  Used for at least a dozen, and probably
twice as many languages.

>      Arabian

Used for at least a dozen, and probably twice as many languages ---
including one that Sun more or less implied will never be supported by
OOo.

>      Hebrew

Used for four languages.

>      Japanese

Four (or five, depending upon how you count) different writing
systems.  I think that eliminating the "writing system" used by
tokonono would be acceptable.]

> Even scanning for small but significant and most common words might be a good 
> idea. 

"die" is a definite article in Afrikaans, a synonym for deceased in
English, and "tie" in a third language.

"is" is a fairly common spelling error made by people learning German
and Dutch as a second language.

>Thus having a key set of such words might be useful as well.

A list of the 50 most common words, for the 100 most used languages,
when combined with the letter frequency counts of those languages
_might_ be accurate.

But is also likely to alienate the users of "minority languages". 
Unless they can also be added to that list, as part of the l10n work
in preparing OOo for the language.

Getting word and letter frequency counts of the languages, as they are
written/used is a fairly trivial function to add to Kevin's language
crawler.

> As for guessing the language by a single character there are a number of 
> character that could be associated to a single or smalle set of language by 
> their code point in the Unicode set.

That simply serves to reduce the number of languages from 7 000+ to a
thousand or so [Latin-1 subrange] down to two [Indus Valley language
sub-ranges.]

But this only works if the text doesn't contain material in a second language.

I've got a couple of Bibles that intermix Hebrew and Greek in the
English text  [For Biblical studies, and several other fields,
intermixing two or three languages is a common practice.]

xan

jonathon
-- 
Does your Office Suite conform to ISO Standards?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] Looking for volunteer programmers...

Reply via email to