ge wrote:
Javier Sola wrote:
The only good solution that I see is to used dictionary based line
breaking, and also spellchecker, but this takes some work with ICU and
with OpenOffice, as well as very good word lists.

For dictionary-based breaking, Tsheng must be reclasified as non-boundary.

We thought in the past long about this in the case of Thai, and we could not find any solution.
For Thai it actually works quite well in ICU, but the list of words is too short. It is nevertheless overridden by code in OpenOffice that makes Thai line-breaking syllable based :-(

Could you please give a concrete example what you mean?
You probably mean line breaking = word breaking, right?
But that does not clarify either, what you mean for me....
Lne-breaking and word-bundaries are different. For example, you do not put a a line-reak before a space (otherwise the space would be the first character of the next line), but you put a word-boundary before and after the space, for example, in "is the mouse red?" line breaks are "is |the |mouse| red?" but word boundaries are "is| |the| |mouse| |red|?" so that spaces are not sent attached to the words to the spellchecker. Each character in unicode has line breaking properties and word boundary properties
Very good word list is a requirement for ANY language
for quality spell checking, exactly like a very good
affix file.
Spell-checking and line breaking lists do not need to be identical. There are words that you might not want to break, but you spellcheck separatelly.. or vice-versa
How comes ICU here?
It does the tokenization (puts the word boundaries in) for openoffice, as well as the rendering of complex scripts. Some of the library's code has been integrated in OOo and modified, for specific uses or languages.
I think, that when word breaks are the same as syllable
breaks, there is NO solution at all. Unfortunately.
They are different, one word can have one or several syllables. Breaking by syllables is easy, specially in Dzongkha, where there is a character that is the end-of-syllable character.
He can not change the original text, and can not modify
either syllable break or word break.

A machine can not find out from syllables, which combination
is valid and which is not using just a syllable list.
Actually, yes, because the scripts that have this problem are abuguidas (scritps that originate in Brahmi) and they have orthographic syllables in recognizable clusters (Thai is the most complex one)... but breaking in syllables is always a bad solution, words are better.

Cheers,

Javier
-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

Reply via email to