subject:"Adding Extension for Experimental Thai Spelling"

Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Richard Wordingham

On Thu, 27 Sep 2012 21:08:13 +0700
Nathan Wells  wrote:

> Firstly, you are right, I was mistaken about ICU and the breakiterator
> working for sentences (I just tried it right now and it does work,
> but just not with the normal "khan" or "period" of Khmer rather it
> works with Latin sentence markers which is not enough).  I had
> thought when we put in the code for the breakiterator that it also
> covered the sentence, but I guess not (I will work towards getting it
> working for Khmer).

It may be worth modifying the CLDR definition - sentence breaks can be
customised, though it is presently only done for Greek.  However, if
you want Khmer *sentence* rather than *clause* breaking, it will need a
lot of work - papers are still being published on breaking Thai into
sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ).

> In response to your comments:
> 
> > 1) The user always marks word breaks with ZWSP.
> > In this case, the ideal is to switch off the break iterator for the
> > language.
> 
> 
> There is some truth to this - and that is why I had it as my last
> option (just turning the whole thing off). But the ICU breakiterator
> for Khmer actually works quite well with normal language - it breaks
> down when there are proper names. So turning it off is an option, but
> not the most ideal solution. Some users will continue to always mark
> breaks with a ZWSP (for full control), but I also think having the
> option to turn it off for more complex sentences would be ideal.
> 
> > 2) The user never marks word breaks.
> > In this case, the user is totally dependent on the break iterator,
> > and cannot be helped when it fails.
> 
> As I said above, I think a both/and solution would be idea for Khmer.
> But if in the end it would work better for Thai to have and "off" and
> "on" option only, that would be fine for Khmer as well for now, until
> we can come up with a more ideal solution.
> 
> 
> > 3) The user only marks word breaks and non-word breaks when the
> > iterator fails.
> 
> The problem with this in Khmer is the user cannot tell when the
> breakiterator fails, unless it is on a line-break.  A word could be
> broken up into three parts and the user would never know it.

I usually notice iterator failures in Thai with unrecognised words,
which prompts red ink over strange extents. Usually the words are not
recognised because they're misspelt, but not always.  The problem I see
in Thai is usually not so much as extra word boundaries as misplaced
word boundaries. 

> Actually, if users could see where the
> breakiterator is breaking words, that would simplify things a lot.

That is a very significant observation.

> The only problem with this would be at the beginning of a document or
> the beginning of any new "re-syncing" segment because you might run
> into something like this:

> User input (example in English so others can make sense of it I hope):
> wordwordwordwordword.
> How the sentence is broken up by the breakiterator: wo r d word word
> wo rd word.
> User adds ZWSP to fix broken word on line-break: wo r d word word
> ZWSPwordword.

This example confuses me.  The problem here seems to be extra word
breaks rather than missing word breaks, and I don't see how confirming
a word break helps.

> But user has no idea the first word is broken incorrectly and that it
> is also spelled incorrectly.

> This is why it would be best (I think) as Martin suggested that when
> a ZWSP is detected it also turn off break iteration for the previous
> words up until a re-sync point.  This would practicly give the user
> an "off" option for the whole document if they so chose, and without
> the confusion of having to find some option in the Tools menu to turn
> it on or off - it would just be automatic, depending on the user's
> habit.

I was clearly not clear enough.  In the example above,
'wordwordwordwordword' is what I would call a dictionariless word - a
word-breaker without a dictionary (e.g. a shell's parser) would see it
as just one 'word'.  Therefore, once ZWSP is inserted and
word-breaking disabled, dictionary-based word-breaking is not applied to
wordwordwordZWSPwordword, and, typically, red squiggles appear under
wordwordword and wordword.  The boundary may be revealed by a phase
discontinuity or gap in the squiggle.  Under the proposed scheme, user
has to introduce another three ZWSPs even if the dictionary contains
all the words.

> I agree with this:
> 
> > Considering these four use cases, it seems simplest to let ZWSP, WJ
> > and ZWNBSP disable the iterator for the extent of the
> > dictionariless word in which it occurs.

> Except, it also should disable the breakiterator up to the previous
> re-sync point...

But that is what I meant!

> But actually, there is a rule in ICU for the MAIYAMOK
> so unless that is not working properly, I am not sure why LibreOffice
> doesn't break correctly...

I'll have to look further into this - and check that misbehaviour is
still happening.  Squiggl

Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Richard Wordingham

On Thu, 27 Sep 2012 11:52:26 +0700
Nathan Wells  wrote:

>> 1. If you are shutting off the ICU breakiterator for text following,
>> we
>> should probably also do it for text preceding. Thus if there is a
>> ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break
>> iteration is disabled for the whole sentence.

> Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU
> break iteration should be disabled for the whole sentence.

What is the logic of this?

The use cases I see are:

1) The user always marks word breaks with ZWSP.

In this case, the ideal is to switch off the break iterator for the
language.

2) The user never marks word breaks.

In this case, the user is totally dependent on the break iterator, and
cannot be helped when it fails.

3) The user only marks word breaks and non-word breaks when the iterator
fails.

In this case, the iterator need only be switched off from the point of
override until it can clearly re-synch.  The obvious re-synching points
are word external punctuation, such as end-of-line, white space,
quotation marks, commas and dandas (and as dandas I would include U+0E2F
THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
ฯลฯ and ฯเปฯ).

Now, it may be easier to explain the rule if it applies to the whole
'word' - for what we are looking at is pretty much a 'word' as
understood by dictionariless editors.

4) Different parts of the text comes from different sources - some mark
word breaks, others expect the application to correctly identify them.

A ZWSP in a chunk of text would then tag the text as having come from a
a user in case 1 or 3; we have no reliable way of distinguishing the
two cases.  A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so
paragraph initial is suspect) would strongly suggest use case 3 - but
might occur in use case 1 if the user has had to fight a break
iterator.

(end of use cases)

Considering these four use cases, it seems simplest to let ZWSP, WJ and
ZWNBSP disable the iterator for the extent of the dictionariless word
in which it occurs.

What is the definition of an ICU sentence boundary?  I see no evidence
from CLDR 2.9 that it should be even approximately right for Khmer (or
Thai). Splitting Thai text into sentences is known to be challenging -
we can therefore expect different applications to split text
differently.

The one downside I can see to my suggestion is that if all word
boundaries are marked, switching the iterator off dictionariless word
by dictionariless word will require slightly greater use of WJ, for a
ZWSP later in the sentence will not necessarily be in the same
dictionariless word.

A related issue that seems not to being handled is repetition mark U+0E46 THAI
CHARACTER MAIYAMOK.  It should be separated from the preceding
alphabetic characters by a space, but Libreoffice doesn't recognised
the sequence as a possible continuation of the word.  Sometimes it
is a necessary part of a word.  I don't know what the situation is in
Khmer.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Martin Hosken

Dear Nathan,

> Here are some new ideas, ordered by desirability, with number one being the
> most desired, to number three being the least.
> 
> 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator
> for Khmer spell checking for characters following the zero-width space
> until encounters real space (U+0020) or end of sentence (detect end of
> sentence using ICU Sentence Boundary).

I think this is a good direction to head. I have to follow on comments:

1. If you are shutting off the ICU breakiterator for text following, we should 
probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP 
(U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the 
whole sentence.

2. Why limit this to Khmer? I suspect as a model it should work for any 
non-space broken text.

Yours,
Martin



> 
> 2) Disable use of ICU breakiterator for Khmer spell checking by default,
> but allow users to enable it by adding a check-box to enable ICU
> breakiterator in the Tools > Options > Language Settings > Writing Aids >
> Options dialogue when a Khmer Hunspell dictionary is present (
> http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
>  ).
> 
> 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU
> breakiterator for Khmer is more accurate.
> 
> Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6
> it causes a lot of spelling errors to go unnoticed since the ICU
> breakiterator breaks words up incorrectly. So hopfully we can find a
> solution that will work with the current ICU breakiterator - though with
> ICU 50.1 the breakiterator for Khmer will have some improvements. But I do
> feel if solution 1 or 2 (or if someone else has better ideas) cannot
> be implemented the breakiterator for spelling with Khmer should be turned
> off in LibreOffice until the ICU breakiterator for Khmer is more accurate.
> 
> 
> Thanks again for your help and time, your input is greatly appreciated!
> 
> Sincerely,
> 
> Nathan
> 
> 
> 
> On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken wrote:
> 
> > Dear All,
> >
> > > > An automatic word and line breaker is very necessary for Khmer and
> > > > Thai because traditionally they have no spaces between words, and so
> > > > line-breaking and spell checking require the use of a zero-width space
> > > > between words which is counterintuitive for most native speakers, and
> > > > so spell checking goes widely unused.
> >
> > I agree that automatic word breaking is a good thing and I am relieved to
> > see that libreoffice does it based on language selection and not on
> > automatic language guessing based on scripts. There are more languages that
> > use Thai script and Khmer script than just Thai and Khmer. So one of my
> > fears is already alleviated :)
> >
> > > > But now with the ICU code you implemented, Thai and Khmer can be
> > > > automatically broken, and the results are quite good. But with its
> > > > implementation in the real world, I have found some issues that I
> > > > wanted to raise and also suggest possible solutions. I write this as
> > > > an end-user, not so much as a programmer, nor do I claim to fully
> > > > understand the inner-workings of ICU and LibreOffice (because I don't!
> > > > ).
> > > >
> > > > First, I will do my best to explain the current results of the ICU
> > > > break iterator with Khmer:
> > > >
> > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > > >
> > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > > >
> > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > > > ឈ្មោះ|សិវកឥវលិយៈ
> > > >
> > > > The differences should be clear – the ICU break iterator does not
> > > > break the words with 100% accuracy.
> > > >
> > > > One possible solution to this issue is by how the ICU Break Iterator
> > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> > > > code was enabled to automatically break Khmer, if an end-user wanted
> > > > to spell check Khmer, they had to manually place U+200B characters to
> > > > separate words. This solution worked quite well, but was
> > > > counterintuitive to most native speakers, because Khmer has no spaces
> > > > (as stated before). But with this solution, an end-user could be sure
> > > > that their document was broken with 100% accuracy, if there was no
> > > > human error (something automatic solutions cannot do – it is more
> > > > along the lines of 80% accurate). What I propose, is that the break
> > > > iterator code in LibreOffice looks for U+200B characters in a given
> > > > string and considers them as a sign to NOT automatically break, but to
> > > > allow the end-user full control to manually break words. Let me
> > > > explain:
> > > >
> > > >  1. The code starts processing the text and automatically breaking
> > > > it until it comes across a U+200B character. If one is found,
> > > > it

Re: Adding Extension for Experimental Thai Spelling

2012-09-27 Thread Nathan Wells

Thanks for your input Richard,

Firstly, you are right, I was mistaken about ICU and the breakiterator
working for sentences (I just tried it right now and it does work, but just
not with the normal "khan" or "period" of Khmer rather it works with Latin
sentence markers which is not enough).  I had thought when we put in the
code for the breakiterator that it also covered the sentence, but I guess
not (I will work towards getting it working for Khmer).

In response to your comments:

1) The user always marks word breaks with ZWSP.
> In this case, the ideal is to switch off the break iterator for the
> language.


There is some truth to this - and that is why I had it as my last option
(just turning the whole thing off). But the ICU breakiterator for Khmer
actually works quite well with normal language - it breaks down when there
are proper names. So turning it off is an option, but not the most ideal
solution. Some users will continue to always mark breaks with a ZWSP (for
full control), but I also think having the option to turn it off for more
complex sentences would be ideal.

2) The user never marks word breaks.
> In this case, the user is totally dependent on the break iterator, and
> cannot be helped when it fails.

As I said above, I think a both/and solution would be idea for Khmer. But
if in the end it would work better for Thai to have and "off" and "on"
option only, that would be fine for Khmer as well for now, until we can
come up with a more ideal solution.


3) The user only marks word breaks and non-word breaks when the iterator
> fails.

The problem with this in Khmer is the user cannot tell when the
breakiterator fails, unless it is on a line-break.  A word could be broken
up into three parts and the user would never know it. This is why the issue
is so complex. Actually, if users could see where the breakiterator is
breaking words, that would simplify things a lot. Though I still think the
option to turn the breakiterator "on" or "off" for certain sentences would
be ideal (especially sentences with a ton of proper nouns where the ICU
breakiterator for Khmer has the most trouble).

As far as finding re-syncing points (when to turn the breakitorator back on
when it is turned off by a ZWSP) I agree with you:

> The obvious re-synching points
> are word external punctuation, such as end-of-line, white space,
> quotation marks, commas and dandas (and as dandas I would include U+0E2F
> THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5
> KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai
> ??? and ).


The only problem with this would be at the beginning of a document or the
beginning of any new "re-syncing" segment because you might run into
something like this:

User input (example in English so others can make sense of it I hope):
wordwordwordwordword.
How the sentence is broken up by the breakiterator: wo r d word word wo rd
word.
User adds ZWSP to fix broken word on line-break: wo r d word word
ZWSPwordword.
But user has no idea the first word is broken incorrectly and that it is
also spelled incorrectly.

This is why it would be best (I think) as Martin suggested that when a ZWSP
is detected it also turn off break iteration for the previous words up
until a re-sync point.  This would practicly give the user an "off" option
for the whole document if they so chose, and without the confusion of
having to find some option in the Tools menu to turn it on or off - it
would just be automatic, depending on the user's habit.

I agree with this:

> Considering these four use cases, it seems simplest to let ZWSP, WJ and
> ZWNBSP disable the iterator for the extent of the dictionariless word
> in which it occurs.


Except, it also should disable the breakiterator up to the previous re-sync
point to enable users to functionally "turn off" the breakitorator if they
so choose (for Khmer this is necessary because for a book editor like
myself, I will want to manually put the breaks and not let the
breakitorator do anything automatically - but the feature is nice for the
casual user because it is much faster and more intuitive to not type spaces
between words for Cambodians).

A related issue that seems not to being handled is repetition mark U+0E46
> THAI
> CHARACTER MAIYAMOK.  It should be separated from the preceding
> alphabetic characters by a space, but Libreoffice doesn't recognised
> the sequence as a possible continuation of the word.  Sometimes it
> is a necessary part of a word.  I don't know what the situation is in
> Khmer.


In Khmer the repeat character (U+17D7 LEK TOO) is not separated from the
preceding word by a space, but is connected, so this is not an issue for
us.  But actually, there is a rule in ICU for the MAIYAMOK so unless that
is not working properly, I am not sure why LibreOffice doesn't break
correctly...

Here's the code from ICU4c for the Thai  MAIYAMOK from dictbe.cpp if anyone
is interested...

if (uc == 
THAI_MAIYAMOK

Re: Adding Extension for Experimental Thai Spelling

2012-09-26 Thread Nathan Wells

Thanks Martin,


1. If you are shutting off the ICU breakiterator for text following, we
> should probably also do it for text preceding. Thus if there is a ZWSP or
> ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
> for the whole sentence.


Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break
iteration should be disabled for the whole sentence.


2. Why limit this to Khmer? I suspect as a model it should work for any
> non-space broken text.


I am only limiting it to Khmer because that is my expertise and I didn't
want to cause problems for other languages - but it is possible these
changes would be beneficial for other languages that are not broken by
spaces (like Thai).


Thanks,
Nathan

On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken wrote:

> Dear Nathan,
>
> > Here are some new ideas, ordered by desirability, with number one being
> the
> > most desired, to number three being the least.
> >
> > 1) When a zero-width space is detected (U+200B), shut off ICU
> breakiterator
> > for Khmer spell checking for characters following the zero-width space
> > until encounters real space (U+0020) or end of sentence (detect end of
> > sentence using ICU Sentence Boundary).
>
> I think this is a good direction to head. I have to follow on comments:
>
> * 1. If you are shutting off the ICU breakiterator for text following, we
> should probably also do it for text preceding. Thus if there is a ZWSP or
> ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled
> for the whole sentence.
>
> 2. Why limit this to Khmer? I suspect as a model it should work for any
> non-space broken text.*
>
> Yours,
> Martin
>
>
>
> >
> > 2) Disable use of ICU breakiterator for Khmer spell checking by default,
> > but allow users to enable it by adding a check-box to enable ICU
> > breakiterator in the Tools > Options > Language Settings > Writing Aids >
> > Options dialogue when a Khmer Hunspell dictionary is present (
> >
> http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
> >  ).
> >
> > 3) Disable use of ICU breakiterator for Khmer spell checking until the
> ICU
> > breakiterator for Khmer is more accurate.
> >
> > Currently, with the ICU breakiterator for Khmer enabled in LibreOffice
> 3.6
> > it causes a lot of spelling errors to go unnoticed since the ICU
> > breakiterator breaks words up incorrectly. So hopfully we can find a
> > solution that will work with the current ICU breakiterator - though with
> > ICU 50.1 the breakiterator for Khmer will have some improvements. But I
> do
> > feel if solution 1 or 2 (or if someone else has better ideas) cannot
> > be implemented the breakiterator for spelling with Khmer should be turned
> > off in LibreOffice until the ICU breakiterator for Khmer is more
> accurate.
> >
> >
> > Thanks again for your help and time, your input is greatly appreciated!
> >
> > Sincerely,
> >
> > Nathan
> >
> >
> >
> > On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken  >wrote:
> >
> > > Dear All,
> > >
> > > > > An automatic word and line breaker is very necessary for Khmer and
> > > > > Thai because traditionally they have no spaces between words, and
> so
> > > > > line-breaking and spell checking require the use of a zero-width
> space
> > > > > between words which is counterintuitive for most native speakers,
> and
> > > > > so spell checking goes widely unused.
> > >
> > > I agree that automatic word breaking is a good thing and I am relieved
> to
> > > see that libreoffice does it based on language selection and not on
> > > automatic language guessing based on scripts. There are more languages
> that
> > > use Thai script and Khmer script than just Thai and Khmer. So one of my
> > > fears is already alleviated :)
> > >
> > > > > But now with the ICU code you implemented, Thai and Khmer can be
> > > > > automatically broken, and the results are quite good. But with its
> > > > > implementation in the real world, I have found some issues that I
> > > > > wanted to raise and also suggest possible solutions. I write this
> as
> > > > > an end-user, not so much as a programmer, nor do I claim to fully
> > > > > understand the inner-workings of ICU and LibreOffice (because I
> don't!
> > > > > ).
> > > > >
> > > > > First, I will do my best to explain the current results of the ICU
> > > > > break iterator with Khmer:
> > > > >
> > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > > > >
> > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > > > >
> > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > > > > ឈ្មោះ|សិវកឥវលិយៈ
> > > > >
> > > > > The differences should be clear – the ICU break iterator does not
> > > > > break the words with 100% accuracy.
> > > > >
> > > > > One possible solution to this issue is by how the ICU Break
> Iterator
> > > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before
> ICU
> > > > > code was enab

Re: Adding Extension for Experimental Thai Spelling

2012-09-26 Thread Nathan Wells

Hello Again,

Thank you all for your input!

This is a deeper problem than I first thought...sorry for the delayed
response, but I hope a solution can be found, even though the current ICU
breakiterator is not at 100% for Khmer.

Here are some new ideas, ordered by desirability, with number one being the
most desired, to number three being the least.

1) When a zero-width space is detected (U+200B), shut off ICU breakiterator
for Khmer spell checking for characters following the zero-width space
until encounters real space (U+0020) or end of sentence (detect end of
sentence using ICU Sentence Boundary).

2) Disable use of ICU breakiterator for Khmer spell checking by default,
but allow users to enable it by adding a check-box to enable ICU
breakiterator in the Tools > Options > Language Settings > Writing Aids >
Options dialogue when a Khmer Hunspell dictionary is present (
http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version
 ).

3) Disable use of ICU breakiterator for Khmer spell checking until the ICU
breakiterator for Khmer is more accurate.

Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6
it causes a lot of spelling errors to go unnoticed since the ICU
breakiterator breaks words up incorrectly. So hopfully we can find a
solution that will work with the current ICU breakiterator - though with
ICU 50.1 the breakiterator for Khmer will have some improvements. But I do
feel if solution 1 or 2 (or if someone else has better ideas) cannot
be implemented the breakiterator for spelling with Khmer should be turned
off in LibreOffice until the ICU breakiterator for Khmer is more accurate.

Thanks again for your help and time, your input is greatly appreciated!

Sincerely,

Nathan

On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken wrote:

> Dear All,
>
> > > An automatic word and line breaker is very necessary for Khmer and
> > > Thai because traditionally they have no spaces between words, and so
> > > line-breaking and spell checking require the use of a zero-width space
> > > between words which is counterintuitive for most native speakers, and
> > > so spell checking goes widely unused.
>
> I agree that automatic word breaking is a good thing and I am relieved to
> see that libreoffice does it based on language selection and not on
> automatic language guessing based on scripts. There are more languages that
> use Thai script and Khmer script than just Thai and Khmer. So one of my
> fears is already alleviated :)
>
> > > But now with the ICU code you implemented, Thai and Khmer can be
> > > automatically broken, and the results are quite good. But with its
> > > implementation in the real world, I have found some issues that I
> > > wanted to raise and also suggest possible solutions. I write this as
> > > an end-user, not so much as a programmer, nor do I claim to fully
> > > understand the inner-workings of ICU and LibreOffice (because I don't!
> > > ).
> > >
> > > First, I will do my best to explain the current results of the ICU
> > > break iterator with Khmer:
> > >
> > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > >
> > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > >
> > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > > ឈ្មោះ|សិវកឥវលិយៈ
> > >
> > > The differences should be clear – the ICU break iterator does not
> > > break the words with 100% accuracy.
> > >
> > > One possible solution to this issue is by how the ICU Break Iterator
> > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> > > code was enabled to automatically break Khmer, if an end-user wanted
> > > to spell check Khmer, they had to manually place U+200B characters to
> > > separate words. This solution worked quite well, but was
> > > counterintuitive to most native speakers, because Khmer has no spaces
> > > (as stated before). But with this solution, an end-user could be sure
> > > that their document was broken with 100% accuracy, if there was no
> > > human error (something automatic solutions cannot do – it is more
> > > along the lines of 80% accurate). What I propose, is that the break
> > > iterator code in LibreOffice looks for U+200B characters in a given
> > > string and considers them as a sign to NOT automatically break, but to
> > > allow the end-user full control to manually break words. Let me
> > > explain:
> > >
> > >  1. The code starts processing the text and automatically breaking
> > > it until it comes across a U+200B character. If one is found,
> > > it searches to see if there are any additional U+200B or U
> > > +0020 characters in the following 20 characters (or so), and
> > > if there are, the break iterator skips over those characters
> > > and starts again from the second U+200B character (or U+0020,
> > > but a U+0020 character would only signify the “close” of the
> > > manual break because sometimes a

Re: Adding Extension for Experimental Thai Spelling

2012-07-27 Thread Richard Wordingham

On Thu, 26 Jul 2012 16:33:00 +0700
Martin Hosken  wrote:

> 1. use of U+2060 makes string searching and spell checking harder
> (unless WJ chars are stripped for searching and spell checking). They
> are not part of the spelling of a word, so their introduction in the
> underlying text stream is problematic for other text processing
> processes (like searching as mentioned). This is less of an issue for
> U+200B ZWSP because that occurs between words and searching across
> word boundaries is a rarer activity. Likewise spell checking across
> word boundaries isn't really needed.

U+2060 WJ should definitely be skipped for searching and, once it has
done its gluing job, spell-checking look-up, just like U+00AD SOFT
HYPHEN.  They're both indubitable complete ignorables for collation and
therefore for UCA (Unicode Collation Algorithm) search.

> Now what happens if I want to put zw around a word that occurs < 20
> chars after my last zw? The on off nature of the zw has now been
> inverted. One option is to say that zw must always occur in pairs and
> you would have to bracket your first or second word there. But then
> management of which zw is on and which is off will get confusing for
> users.

I think that is the wrong way of looking at it.  Various characters,
some ZWSP, others more natural, such as SP, tell the break iterators
where some word boundaries are.  The rule we would have is that the
break iterator should not try to break runs of less than, say, 20
characters if one of the boundaries is provided by ZWSP.  I am not
proposing that we limit how many breaks it makes in a run - 21
characters could be broken into seven words.  The short runs the break
iterator is prohibited from breaking can still be checked for spelling.
If they are not words, then the user can respond to the red wiggly line
appropriately, e.g. by putting extra word breaks in.

In the example you gave, one would have to split the words between the
delimited words.  I think the users must accept that - the rule we
would be working with is that the break iterator does not break short
runs created by inserted ZWSP, and that is a simple rule to
understand.  I suppose there may be some question of what to count -
base consonants perhaps? (In Unicode jargon, that would be extended
default graphemes.)  That might be a luxury feature we never need to
add.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-07-26 Thread Martin Hosken

Dear All,

> > An automatic word and line breaker is very necessary for Khmer and
> > Thai because traditionally they have no spaces between words, and so
> > line-breaking and spell checking require the use of a zero-width space
> > between words which is counterintuitive for most native speakers, and
> > so spell checking goes widely unused.

I agree that automatic word breaking is a good thing and I am relieved to see 
that libreoffice does it based on language selection and not on automatic 
language guessing based on scripts. There are more languages that use Thai 
script and Khmer script than just Thai and Khmer. So one of my fears is already 
alleviated :)

> > But now with the ICU code you implemented, Thai and Khmer can be
> > automatically broken, and the results are quite good. But with its
> > implementation in the real world, I have found some issues that I
> > wanted to raise and also suggest possible solutions. I write this as
> > an end-user, not so much as a programmer, nor do I claim to fully
> > understand the inner-workings of ICU and LibreOffice (because I don't!
> > ).
> > 
> > First, I will do my best to explain the current results of the ICU
> > break iterator with Khmer:
> > 
> > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> > 
> > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> > 
> > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > ឈ្មោះ|សិវកឥវលិយៈ
> > 
> > The differences should be clear – the ICU break iterator does not
> > break the words with 100% accuracy.
> > 
> > One possible solution to this issue is by how the ICU Break Iterator
> > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> > code was enabled to automatically break Khmer, if an end-user wanted
> > to spell check Khmer, they had to manually place U+200B characters to
> > separate words. This solution worked quite well, but was
> > counterintuitive to most native speakers, because Khmer has no spaces
> > (as stated before). But with this solution, an end-user could be sure
> > that their document was broken with 100% accuracy, if there was no
> > human error (something automatic solutions cannot do – it is more
> > along the lines of 80% accurate). What I propose, is that the break
> > iterator code in LibreOffice looks for U+200B characters in a given
> > string and considers them as a sign to NOT automatically break, but to
> > allow the end-user full control to manually break words. Let me
> > explain:
> > 
> >  1. The code starts processing the text and automatically breaking
> > it until it comes across a U+200B character. If one is found,
> > it searches to see if there are any additional U+200B or U
> > +0020 characters in the following 20 characters (or so), and
> > if there are, the break iterator skips over those characters
> > and starts again from the second U+200B character (or U+0020,
> > but a U+0020 character would only signify the “close” of the
> > manual break because sometimes a phrase will end and there
> > will be an actual space – so if the word that the user wants
> > to manually break has a “real” U+0020 space at the end of it,
> > then the user does not need to put an additional U+200B
> > character to close it) which then repeats, looking for U+200B
> > characters etc.
> > 
> >  2. This would allow end-users to choose to manually break their
> > whole document so they can have precise control, as well as
> > allow end-users to place U+200B characters around names of
> > people, places or transliterations in order to tell the break
> > iterator to not try to break those words.

In principle I like this approach. I like the idea of being able to force 
breaks and non-breaks. But I don't think we are quite there with this solution 
yet. Here are my difficulties with it:

1. use of U+2060 makes string searching and spell checking harder (unless WJ 
chars are stripped for searching and spell checking). They are not part of the 
spelling of a word, so their introduction in the underlying text stream is 
problematic for other text processing processes (like searching as mentioned). 
This is less of an issue for U+200B ZWSP because that occurs between words and 
searching across word boundaries is a rarer activity. Likewise spell checking 
across word boundaries isn't really needed.

2. How do we come up with the range of what is considered a word between two 
zwsp chars as opposed to two words? How close to the end of a string must a 
zwsp occur to disable all breaking before the end of the string? does 
"abcdefuvwxyz" block all breaks in the string? I think we need to think 
harder (deeper) about the use of zwsp in this way and see if we can come up 
with something with a little less ambiguity. Having said that, I think we are 
going to have to think really hard, because I don't think

Re: Adding Extension for Experimental Thai Spelling

2012-07-25 Thread Nathan Wells

Thanks for your reply.

Yes, a  "view->word boundaries"  mode would be very helpful (or
even incorporating the current "view->field shading" to include viewing
'gray marks' at the automatic ICU breaking so that users can see what is
being done). Would this be hard to implement?

Also, we are making some changes to the ICU break iterator dictionary for
Khmer - and I've heard there will be some changes in ICU 50 which should
improve results for Khmer.

If anyone has any ideas - it would be appreciated.

Thanks!
Nathan


On Wed, Jul 25, 2012 at 8:41 PM, Caolán McNamara  wrote:

> I'll cc this to the list if you don't mind, in order to archive it. I
> have no immediate great ideas. But I wonder if a "view->word boundaries"
> mode would be helpful, i.e. something that indicates the boundaries of
> the words that the software thinks exist.
>
> On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
> >
> > I hope you don't mind if I write and ask some more questions and ask
> > for additional help in making the break iterator more functional in
> > LibreOffice. Thank you again for your help implementing ICU for Khmer
> > in LibreOffice. I downloaded a recent beta build with your code
> > implemented and did some testing – it is great! But it also brought to
> > my attention some issues that hamper the useability of the automatic
> > breaking for Khmer (and I also believe for Thai – see this discussion
> > -
> >
> http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455
> ).
> >
> >
> > An automatic word and line breaker is very necessary for Khmer and
> > Thai because traditionally they have no spaces between words, and so
> > line-breaking and spell checking require the use of a zero-width space
> > between words which is counterintuitive for most native speakers, and
> > so spell checking goes widely unused.
> > But now with the ICU code you implemented, Thai and Khmer can be
> > automatically broken, and the results are quite good. But with its
> > implementation in the real world, I have found some issues that I
> > wanted to raise and also suggest possible solutions. I write this as
> > an end-user, not so much as a programmer, nor do I claim to fully
> > understand the inner-workings of ICU and LibreOffice (because I don't!
> > ).
> >
> > First, I will do my best to explain the current results of the ICU
> > break iterator with Khmer:
> >
> > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> >
> > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> >
> > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> > ឈ្មោះ|សិវកឥវលិយៈ
> >
> > The differences should be clear – the ICU break iterator does not
> > break the words with 100% accuracy.
> >
> > But, obviously with a dictionary approach, no automatic word breaker
> > will ever break correctly 100% of the time. There is no solution that
> > will currently automatically break Thai or Khmer 100% correctly (I
> > have used, Hidden Markov Model breakers, dictionary probability
> > breakers, and plain dictionary breakers – none work 100% of a time)
> > because, especially for names and places, words in Khmer can just defy
> > all rules and patterns. Perhaps in the future, a solution will arise
> > that can break Khmer words with 100% accuracy, but at this time, we
> > are far from any such solution.
> >
> > And this is an important reality to remember, because it
> > differentiates Thai and Khmer (and possibly other languages that do
> > not use spaces between words) from Western languages such as English,
> > where a line-breaker and word-breaker can be correct 100% of the time.
> >
> > As an end user, this inability of the ICU break iterator to break
> > Khmer words with 100% causes usability issues when it comes to
> > correcting the automatic breaks that are broken in error.
> >
> > Here are some reasons why:
> >
> >  1. In LibreOffice a user cannot see where the words have been
> > broken, they are invisible.
> >
> >  2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
> > correct an error in order to correctly spell check is very
> > difficult, because the user cannot see where to place the
> > joiner in order to join the word (as in the example case above
> > the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
> > to join it to be treated as one word, but the end user does
> > not know this because the breaks are invisible.
>
> FWIW with view->field shading on you should see a little gray mark where
> the word joiner exists. At least I do anyway.
>
> >  1. Even if LibreOffice were able to change their code so that the
> > end user could see the word-breaks, adding three U+2060
> > characters is quite laborious just to fix one word so that it
> > can be spell checked correctly (as one word, rather than spell
> > checked as four individual words).
> >
> >
> >
> > One

Re: Adding Extension for Experimental Thai Spelling

2012-07-25 Thread Caolán McNamara

I'll cc this to the list if you don't mind, in order to archive it. I
have no immediate great ideas. But I wonder if a "view->word boundaries"
mode would be helpful, i.e. something that indicates the boundaries of
the words that the software thinks exist.

On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote:
> 
> I hope you don't mind if I write and ask some more questions and ask
> for additional help in making the break iterator more functional in
> LibreOffice. Thank you again for your help implementing ICU for Khmer
> in LibreOffice. I downloaded a recent beta build with your code
> implemented and did some testing – it is great! But it also brought to
> my attention some issues that hamper the useability of the automatic
> breaking for Khmer (and I also believe for Thai – see this discussion
> -
> http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455).
>  
> 
> 
> An automatic word and line breaker is very necessary for Khmer and
> Thai because traditionally they have no spaces between words, and so
> line-breaking and spell checking require the use of a zero-width space
> between words which is counterintuitive for most native speakers, and
> so spell checking goes widely unused.
> But now with the ICU code you implemented, Thai and Khmer can be
> automatically broken, and the results are quite good. But with its
> implementation in the real world, I have found some issues that I
> wanted to raise and also suggest possible solutions. I write this as
> an end-user, not so much as a programmer, nor do I claim to fully
> understand the inner-workings of ICU and LibreOffice (because I don't!
> ).
> 
> First, I will do my best to explain the current results of the ICU
> break iterator with Khmer:
> 
> Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ
> 
> Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ
> 
> Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ|
> ឈ្មោះ|សិវកឥវលិយៈ
> 
> The differences should be clear – the ICU break iterator does not
> break the words with 100% accuracy.
> 
> But, obviously with a dictionary approach, no automatic word breaker
> will ever break correctly 100% of the time. There is no solution that
> will currently automatically break Thai or Khmer 100% correctly (I
> have used, Hidden Markov Model breakers, dictionary probability
> breakers, and plain dictionary breakers – none work 100% of a time)
> because, especially for names and places, words in Khmer can just defy
> all rules and patterns. Perhaps in the future, a solution will arise
> that can break Khmer words with 100% accuracy, but at this time, we
> are far from any such solution.
> 
> And this is an important reality to remember, because it
> differentiates Thai and Khmer (and possibly other languages that do
> not use spaces between words) from Western languages such as English,
> where a line-breaker and word-breaker can be correct 100% of the time.
> 
> As an end user, this inability of the ICU break iterator to break
> Khmer words with 100% causes usability issues when it comes to
> correcting the automatic breaks that are broken in error.
> 
> Here are some reasons why:
> 
>  1. In LibreOffice a user cannot see where the words have been
> broken, they are invisible.
> 
>  2. Therefore, trying to use a U+2060 (No Width Word Joiner) to
> correct an error in order to correctly spell check is very
> difficult, because the user cannot see where to place the
> joiner in order to join the word (as in the example case above
> the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters
> to join it to be treated as one word, but the end user does
> not know this because the breaks are invisible.

FWIW with view->field shading on you should see a little gray mark where
the word joiner exists. At least I do anyway.

>  1. Even if LibreOffice were able to change their code so that the
> end user could see the word-breaks, adding three U+2060
> characters is quite laborious just to fix one word so that it
> can be spell checked correctly (as one word, rather than spell
> checked as four individual words).
> 
> 
> 
> One possible solution to this issue is by how the ICU Break Iterator
> interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU
> code was enabled to automatically break Khmer, if an end-user wanted
> to spell check Khmer, they had to manually place U+200B characters to
> separate words. This solution worked quite well, but was
> counterintuitive to most native speakers, because Khmer has no spaces
> (as stated before). But with this solution, an end-user could be sure
> that their document was broken with 100% accuracy, if there was no
> human error (something automatic solutions cannot do – it is more
> along the lines of 80% accurate). What I propose, is that the break
> iterator code in LibreOffice lo

Re: Adding Extension for Experimental Thai Spelling

2012-07-12 Thread sungkhum

Thanks for your reply Caolán,
I have submitted a bug and assigned you to it. I really appreciate you
being willing to look into this!
Here's the bug url:
https://www.libreoffice.org/bugzilla/show_bug.cgi?id=52020
Please let me know if there is anything else I can provide. I have a little
working knowledge of ICU, I helped implement the breakiterator for Khmer by
providing the dictionary and tests, but I am not a programmer by trade.

> There was something similar done in the past IIRC to
> pass around soft-page-break information so that export filters could
> know where the layout last put the page breaks. I forget the details of
> that though.

This would be a very useful feature for Cambodians (and I would assume Thai
as well, although Thai tends to have more programs that currently support
wordbreaking already) - would it be best to seek to do this with an
extension rather than LibreOffice core?

Thanks again for your time,
Nathan


On Thu, Jul 12, 2012 at 11:10 PM, Caolán McNamara [via Document Foundation
Mail Archive]  wrote:

> On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote:
> > I have two questions: is there a way to have the LibreOffice spelling
> > checker (Hunspell) also recognize word-breaks using the ICU break
> iterator
> > for Khmer so that Cambodians no longer have to add zero-width spaces
> > manually (as it seems to work for Thai now?)? Currently, lines without
> > zero-width spaces are seen as one long word to the spelling checker in
> > LibreOffice 3.6. But since the line-breaking is working, it would seem
> > breaking words for the spelling checker should also be able to work.
> Should
> > I submit a bug? How should I proceed?
>
> Sounds like a bug really. I mean, hunspell itself generally doesn't do
> the parsing of text into words, the app gives each word to hunspell. And
> we're *supposed* to be using the icu breakiterator to split words. I
> suspect its a similar bug as this original one.
>
> So... sure, file a bug, assign it to me ([hidden 
> email]<http://user/SendEmail.jtp?type=node&node=3995127&i=0>)
> and paste a
> short two word example text into the bug and indicate where the word
> break should be and I'll add a regression test for it and see if its a
> trivial fix for Khmer too now that we're using the latest-and-greatest
> icu.
>
> > Also, since many other programs do not incorporate ICU's code, is there
> a
> > way to make the line breaks "real" when a document is saved in another
> > format (such as a .doc?). And by "real" I mean that a zero-width space
> is
> > actually added to the text where a line-break should be.
>
> That should at least be theoretically possible, albeit a bit tricky
> seeing as the layout code is the bit that knows the width of the page
> and does the line breaking, while the export filters don't get to know
> that information. There was something similar done in the past IIRC to
> pass around soft-page-break information so that export filters could
> know where the layout last put the page breaks. I forget the details of
> that though.
>
> C.
>
> ___
> LibreOffice mailing list
> [hidden email] <http://user/SendEmail.jtp?type=node&node=3995127&i=1>
> http://lists.freedesktop.org/mailman/listinfo/libreoffice
>
>
> ------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3995127.html
>  To unsubscribe from Adding Extension for Experimental Thai Spelling, click
> here<http://nabble.documentfoundation.org/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3735637&code=c3VuZ2todW1AZ21haWwuY29tfDM3MzU2Mzd8LTE3NzAzNTQxNDk=>
> .
> NAML<http://nabble.documentfoundation.org/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: 
http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3995138.html
Sent from the Dev mailing list archive at Nabble.com.___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-07-12 Thread Caolán McNamara

On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote:
> I have two questions: is there a way to have the LibreOffice spelling
> checker (Hunspell) also recognize word-breaks using the ICU break iterator
> for Khmer so that Cambodians no longer have to add zero-width spaces
> manually (as it seems to work for Thai now?)? Currently, lines without
> zero-width spaces are seen as one long word to the spelling checker in
> LibreOffice 3.6. But since the line-breaking is working, it would seem
> breaking words for the spelling checker should also be able to work. Should
> I submit a bug? How should I proceed?

Sounds like a bug really. I mean, hunspell itself generally doesn't do
the parsing of text into words, the app gives each word to hunspell. And
we're *supposed* to be using the icu breakiterator to split words. I
suspect its a similar bug as this original one.

So... sure, file a bug, assign it to me (caol...@redhat.com) and paste a
short two word example text into the bug and indicate where the word
break should be and I'll add a regression test for it and see if its a
trivial fix for Khmer too now that we're using the latest-and-greatest
icu.

> Also, since many other programs do not incorporate ICU's code, is there a
> way to make the line breaks "real" when a document is saved in another
> format (such as a .doc?). And by "real" I mean that a zero-width space is
> actually added to the text where a line-break should be.

That should at least be theoretically possible, albeit a bit tricky
seeing as the layout code is the bit that knows the width of the page
and does the line breaking, while the export filters don't get to know
that information. There was something similar done in the past IIRC to
pass around soft-page-break information so that export filters could
know where the layout last put the page breaks. I forget the details of
that though.

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-07-08 Thread sungkhum

I hope no one minds if I "piggy-back" on this thread. Recently I contributed
to the ICU break iterator for Khmer and it was added to ICU 4.8 (I just
helped with the dictionary, another volunteer did the code). LibreOffice 3.6
added the updated ICU code and now uses the code to line-break Khmer even if
zero-width spaces have not been provided.

I have two questions: is there a way to have the LibreOffice spelling
checker (Hunspell) also recognize word-breaks using the ICU break iterator
for Khmer so that Cambodians no longer have to add zero-width spaces
manually (as it seems to work for Thai now?)? Currently, lines without
zero-width spaces are seen as one long word to the spelling checker in
LibreOffice 3.6. But since the line-breaking is working, it would seem
breaking words for the spelling checker should also be able to work. Should
I submit a bug? How should I proceed?

Also, since many other programs do not incorporate ICU's code, is there a
way to make the line breaks "real" when a document is saved in another
format (such as a .doc?). And by "real" I mean that a zero-width space is
actually added to the text where a line-break should be. This also would
make LibreOffice a great tool for Cambodians, since most do not like to type
spaces between words (since the language traditionally doesn't have spaces),
but would then allow them to use their work with other programs without
having to manually type spaces between words.

--
View this message in context:
http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3994303.html
Sent from the Dev mailing list archive at Nabble.com.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Richard Wordingham

On Fri, 17 Feb 2012 14:10:21 +
Caolán McNamara  wrote:

> On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote:
> Indeed, yeah, I suppose, assuming its as complicated as "Thai", that
> the right direction would be for someone to write for icu new
> dictionary-based breakiterators for the "nod"(?) language and then the
> rather trivial changes to LibreOffice to know about the language in
> order to mark text as that language to bubble that info down to icu

Northern Thai's not quite as simple or standardised as Siamese!  One can
meet (at least) the following spelling systems:

1) Chiangmai phonetics
2) Chiangrai phonetics (different mapping of tones to Siamese spelling
rules)
3) Transliteration from Tai Tham script (probably rare for connected
text)
4) Tai Tham script

However, perhaps dictionary-based break iterators are something to be
treated like dictionaries.  There are several other writing systems
that could probably benefit from them:

Thai script:
  Northern Thai
  NE Thai (for recording songs - use of Siamese tone rules scrambles
  the tonemarks compared to Siamese cognates)

Khmer script:
  Khmer - there's already a project for this set up on SourceForge.
  Pali

Tai Tham script:
  Tai Khuen
  Tai Lue
  Pali

Lao script
  Lao

Tibetan script
  Tibetan

I've a feeling Burmese may also have a need for dictionary based text
breaking, though it's better behaved for syllable breaking than most of
the others listed here.  Shan would come in the same category.

The above list is not exhaustive.  Tai Lue in Lao script probably
belongs in the list.

Not all Thai script writing systems need a break iterator - some of the
minority languages separate words with spaces, but that's partially a
matter of literacy - Thais start writing Thai with interword gaps and
then learn to suppress the gaps.  Pali written in Thai also separates
words with spaces - but Pali has some very long words!

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Caolán McNamara

On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote:
> I wouldn't expect a dictionary-based line breaker to handle words from
> other languages.  (There's a whole slew of Mon-Khmer languages in
> Thailand, and they mostly use the Thai script when they happen to get
> written.)

Indeed, yeah, I suppose, assuming its as complicated as "Thai", that the
right direction would be for someone to write for icu new
dictionary-based breakiterators for the "nod"(?) language and then the
rather trivial changes to LibreOffice to know about the language in
order to mark text as that language to bubble that info down to icu

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-17 Thread Németh László

Hi,

2012/2/17 Richard Wordingham :
> It's a vast improvement - it gives LibreOffice a real Thai
> spell-checker.  Thank you.  I have one worry for Siamese - Németh László
> suggested that there might be a licensing issue back in
> http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html .

There is no problem with the license of the ICU. I'm also very glad of the fix.

Regards,
László
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-16 Thread Richard Wordingham

On Tue, 14 Feb 2012 16:19:17 +
Caolán McNamara  wrote:

> I think this change:
> http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12
> should improve matters a lot.

It's a vast improvement - it gives LibreOffice a real Thai
spell-checker.  Thank you.  I have one worry for Siamese - Németh László
suggested that there might be a licensing issue back in
http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html .

If there isn't such an issue, does this mean we can hope to see your
fix in LibreOffice 3.5.1?

> Makes "กุหลาบ" get treated as a single
> word in the unit test there now anyway, though the Northern Thai one
> is still not considered a single word, that might be due to the
> oldish icu we're still using.

I wouldn't expect a dictionary-based line breaker to handle words from
other languages.  (There's a whole slew of Mon-Khmer languages in
Thailand, and they mostly use the Thai script when they happen to get
written.)  I can work my way round the problem using the sticking
plaster of ZWSP and WJ (no-break no-space), and I think some use of
them or an equivalent is inevitable when the sequence of visible
characters doesn't define the breaks.  In particular, after gluing
กุ๊หลาบ together with WJ, Hunspell offered me กุหลาบ as a correction,
which is good.

There may be some rough edges with ZWSP and WJ going into the
dictionary (TBC), but what you've done will justify LibreOffice claiming
a Thai spell checking capability.

Minority language support may not be compatible with libthai - at least
one language uses a combining underline, and some of the mark
combinations used for minority languages would get rejected by the WTT
rules that libthai supports.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-14 Thread Eike Rathke

Hi,

On Tuesday, 2012-02-14 16:19:17 +, Caolán McNamara wrote:

> We have some customized break iterator rules in LibreOffice, so we're
> using those ones and *not* the built-in icu ones. But we lack a
> customized Thai one, so we're using some ultra-generic word breaking
> stuff for Thai and not going near the special built-into-icu Thai
> iterator :-(

Right, I think the generic customized one dates back from times where
ICU didn't have a specialized Thai break iterator (not sure about that,
but ...), so it should be good to have that switched to ICU for 'th'.

  Eike

-- 
LibreOffice Calc developer. Number formatter stricken i18n transpositionizer.
GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3  9E96 2F1A D073 293C 05FD


pgpKIpOYOxUeS.pgp
Description: PGP signature
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-14 Thread Caolán McNamara

On Mon, 2012-02-13 at 22:39 +, Richard Wordingham wrote:
> The spell-checker seems to break up a phrase consisting of just กุหลาบ
> into 3 or 4 words.

Hmm, so I played around with this and here's what I think is the
problem...

We have some customized break iterator rules in LibreOffice, so we're
using those ones and *not* the built-in icu ones. But we lack a
customized Thai one, so we're using some ultra-generic word breaking
stuff for Thai and not going near the special built-into-icu Thai
iterator :-(

I think this change:
http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12
should improve matters a lot. Makes "กุหลาบ" get treated as a single
word in the unit test there now anyway, though the Northern Thai one is
still not considered a single word, that might be due to the oldish icu
we're still using.

After some googling I'm unsure if the "right way to go" to further
improve Thai break iterators is to simply have another go at upgrading
icu to get the latest and greatest there, or for "someone" to have a go
at integrating libthai into LibreOffice and hand off break iteration for
Thai to that. Either way, link above and related unit test give an entry
point to the relevant code.

C. 

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Richard Wordingham

Thank you to every one who's offered me advice.

On Mon, 13 Feb 2012 15:08:20 +
Caolán McNamara  wrote:

> I don't think we have any way to override our breakiterators from
> extensions.

Ah well, I'll just have to try to get Thai spell-checking working for
myself and then worry about sharing my changes - assuming I succeed.

> I'd be sort of interested in confirming that what we have right now
> actually works correctly, in the sense that Thai text definitely *is*
> getting run through the special Thai-specific icu word break handler.

It's definitely going through a Siamese-specific word-breaker for
line-breaking.  For example the two-syllable Thai word กุหลาบ
'rose' moves to the next line, but when I convert it to the Northern
Thai form กุ๊หลาบ (not the spelling I'd favour) by adding a
(non-spacing) tone mark, it's promptly broken between lines along the
syllable boundary, although the first syllable does not constitute a
word, at least not one recorded in the Royal Institute Dictionary. I'm
glad to find that inserting U+2060 WJ prevents that break. The
spell-checker seems to break up a phrase consisting of just กุหลาบ into 3 or 4 
words.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Caolán McNamara

On Sat, 2012-02-11 at 16:23 +, Richard Wordingham wrote:
> Is it possible to create an experimental alternative to the Thai
> break iterator that can be shared with other people as a LibreOffice
> extension?

I don't think we have any way to override our breakiterators from
extensions.

FWIW, i18npool/source/breakiterator is where we have our word,
character, sentence and line break iterators implemented. 

Typically we forward everything on to icu to do the real work, albeit
with some customization of the default icu rules.

What I'd *expect* to happen is that text marked as "Thai" should end up
getting broken into words by the default icu word break iterator, which
at http://userguide.icu-project.org/boundaryanalysis claims "ICU
provides a special dictionary-based break iterator."

So, assuming that nothing is simply broken, improving the icu Thai break
iterator should improve the libreoffice "for free".

I'd be sort of interested in confirming that what we have right now
actually works correctly, in the sense that Thai text definitely *is*
getting run through the special Thai-specific icu word break handler.

There is a i18npool/qa/cppunit/test_breakiterator.cxx which we use to
make sure that some existing edge-cases continue to work. If you wanted
to hack that to add some Thai word break tests that'd be helpful, and/or
simply pass me on some sample text where we *are* doing the right thing
and where we *aren't* and I could populate a test in there with that
data and turn the problem into a developer friendly "enable this
word-break unit test and make it work" problem.

C.

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Michael Meeks


On Sat, 2012-02-11 at 16:23 +, Richard Wordingham wrote:
> As I understand it, the lack of a usable Thai spell-checker for
> LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
> break iterator.

In common with many, I know nothing about Thai ;-) but my friend Tim
does - quite possibly he can help you ? (or do you know each other
already) ?

Thanks !

Michael

[ who abnormally leaves the context intact for Tim ;-]

>   (I had expected Thai and Khmer to face similar
> problems, for neither has a visible word separator and syllable
> boundaries are often unclear in both.)  Tagging Thai script text as
> Khmer does not work (at least, not in Version 3.4.5); the word
> boundaries are still determined by the Thai break iterator.
> 
> Is it possible to create an experimental alternative to the Thai
> break iterator that can be shared with other people as a LibreOffice
> extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
> (ZWSP) to separate words in the Thai script, but I suspect Thais would
> not.  Also, I can seem my first useful version fouling up the
> rendering of pre-existing text.  I can't work out how to create a break
> iterator as an *extension*. Could someone please advise me how, e.g. by
> pointing to the documentation or an example.  I can find documentation
> for *publishing* an extension, but that does not address *creating* an
> extension.
> 
> Richard.

-- 
michael.me...@suse.com  <><, Pseudo Engineer, itinerant idiot

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

2012-02-13 Thread Michael Stahl

On 11/02/12 17:23, Richard Wordingham wrote:
> As I understand it, the lack of a usable Thai spell-checker for
> LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
> break iterator.  (I had expected Thai and Khmer to face similar
> problems, for neither has a visible word separator and syllable
> boundaries are often unclear in both.)  Tagging Thai script text as
> Khmer does not work (at least, not in Version 3.4.5); the word
> boundaries are still determined by the Thai break iterator.
> 
> Is it possible to create an experimental alternative to the Thai
> break iterator that can be shared with other people as a LibreOffice
> extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
> (ZWSP) to separate words in the Thai script, but I suspect Thais would
> not.  Also, I can seem my first useful version fouling up the
> rendering of pre-existing text.  I can't work out how to create a break
> iterator as an *extension*. Could someone please advise me how, e.g. by
> pointing to the documentation or an example.  I can find documentation
> for *publishing* an extension, but that does not address *creating* an
> extension.

hi Richard,

while i don't know anything about break iterators, since OOo 3.0.1 there
is a new grammar checking API, which AFAIK operates on a whole paragraph
at a time; perhaps that API would make implementing a spelling checker
for such languages easier (if LO cannot determine the word boundaries
then the checker can always do it on its own).

http://wiki.services.openoffice.org/wiki/Grammar_Checking
http://www.openoffice.org/lingucomponent/grammar.html

regards,
 michael

___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Adding Extension for Experimental Thai Spelling

2012-02-11 Thread Richard Wordingham

As I understand it, the lack of a usable Thai spell-checker for
LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai
break iterator.  (I had expected Thai and Khmer to face similar
problems, for neither has a visible word separator and syllable
boundaries are often unclear in both.)  Tagging Thai script text as
Khmer does not work (at least, not in Version 3.4.5); the word
boundaries are still determined by the Thai break iterator.

Is it possible to create an experimental alternative to the Thai
break iterator that can be shared with other people as a LibreOffice
extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE
(ZWSP) to separate words in the Thai script, but I suspect Thais would
not.  Also, I can seem my first useful version fouling up the
rendering of pre-existing text.  I can't work out how to create a break
iterator as an *extension*. Could someone please advise me how, e.g. by
pointing to the documentation or an example.  I can find documentation
for *publishing* an extension, but that does not address *creating* an
extension.

Richard.
___
LibreOffice mailing list
LibreOffice@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/libreoffice

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Re: Adding Extension for Experimental Thai Spelling

Adding Extension for Experimental Thai Spelling

24 matches

Site Navigation

Mail list logo

Footer information