Re: [lingu-dev] syllable and word.....

Javier SOLA Mon, 15 Jun 2009 01:49:44 -0700

ge wrote:

Javier Sola wrote:

The only good solution that I see is to used dictionary based line
breaking, and also spellchecker, but this takes some work with ICU and
with OpenOffice, as well as very good word lists.


For dictionary-based breaking, Tsheng must be reclasified as non-boundary.

We thought in the past long about this in the case ofThai, and we could not find any solution.

For Thai it actually works quite well in ICU, but the list of words istoo short. It is nevertheless overridden by code in OpenOffice thatmakes Thai line-breaking syllable based :-(

Could you please give a concrete example what you mean?
You probably mean line breaking = word breaking, right?
But that does not clarify either, what you mean for me....

Lne-breaking and word-bundaries are different. For example, you do notput a a line-reak before a space (otherwise the space would be the firstcharacter of the next line), but you put a word-boundary before andafter the space, for example, in "is the mouse red?" line breaks are "is|the |mouse| red?" but word boundaries are "is| |the| |mouse| |red|?" sothat spaces are not sent attached to the words to the spellchecker. Eachcharacter in unicode has line breaking properties and word boundaryproperties

Very good word list is a requirement for ANY language
for quality spell checking, exactly like a very good
affix file.

Spell-checking and line breaking lists do not need to be identical.There are words that you might not want to break, but you spellcheckseparatelly.. or vice-versa

How comes ICU here?

It does the tokenization (puts the word boundaries in) for openoffice,as well as the rendering of complex scripts. Some of the library's codehas been integrated in OOo and modified, for specific uses or languages.

I think, that when word breaks are the same as syllable
breaks, there is NO solution at all. Unfortunately.

They are different, one word can have one or several syllables. Breakingby syllables is easy, specially in Dzongkha, where there is a characterthat is the end-of-syllable character.

He can not change the original text, and can not modify
either syllable break or word break.

A machine can not find out from syllables, which combination
is valid and which is not using just a syllable list.

Actually, yes, because the scripts that have this problem are abuguidas(scritps that originate in Brahmi) and they have orthographic syllablesin recognizable clusters (Thai is the most complex one)... but breakingin syllables is always a bad solution, words are better.


Cheers,

Javier

-eleonora


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

Re: [lingu-dev] syllable and word.....

Reply via email to