Thomas wrote: > Thus I think the kind of thing we really need here is a language guessing > component!
Is there any reason why it can not simply use the languages defined in the numbering/character/paragraph/cell styles? At most, that would be nine languages. By using Unicode sub ranges that could be reduced to a maximum of three. I realize that this would eliminate spellchecking, when the language is set to either "none", "user-1", or "user-2". [The "check all languages" option in OOo currently manages to select the wrong dictionary when those settings are used.] [On a semi-related note, am I the only person who finds it amusing that I can have a user interface in a language, for which I can set neither the locale data, nor the language. Nor include my spellchecker for that language? Especially since locale data has been submitted to issuezilla.] > would be nice to have checks for all the ones where spellchecker > dictionaries available plus the major ones where we are still missing > spellcheckers Having a spell checker decide that I am writing in English, or Esperanto, because I use "Tito" ten times in a paragraph is not going to be acceptable behaviour, when the passage is in Spanish. > Russian Cyrillic Writing System. Used for at least a dozen, and probably twice as many languages. > Arabian Used for at least a dozen, and probably twice as many languages --- including one that Sun more or less implied will never be supported by OOo. > Hebrew Used for four languages. > Japanese Four (or five, depending upon how you count) different writing systems. I think that eliminating the "writing system" used by tokonono would be acceptable.] > Even scanning for small but significant and most common words might be a good > idea. "die" is a definite article in Afrikaans, a synonym for deceased in English, and "tie" in a third language. "is" is a fairly common spelling error made by people learning German and Dutch as a second language. >Thus having a key set of such words might be useful as well. A list of the 50 most common words, for the 100 most used languages, when combined with the letter frequency counts of those languages _might_ be accurate. But is also likely to alienate the users of "minority languages". Unless they can also be added to that list, as part of the l10n work in preparing OOo for the language. Getting word and letter frequency counts of the languages, as they are written/used is a fairly trivial function to add to Kevin's language crawler. > As for guessing the language by a single character there are a number of > character that could be associated to a single or smalle set of language by > their code point in the Unicode set. That simply serves to reduce the number of languages from 7 000+ to a thousand or so [Latin-1 subrange] down to two [Indus Valley language sub-ranges.] But this only works if the text doesn't contain material in a second language. I've got a couple of Bibles that intermix Hebrew and Greek in the English text [For Biblical studies, and several other fields, intermixing two or three languages is a common practice.] xan jonathon -- Does your Office Suite conform to ISO Standards? --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
