Re: Adding Extension for Experimental Thai Spelling
On Thu, 27 Sep 2012 21:08:13 +0700 Nathan Wells wrote: > Firstly, you are right, I was mistaken about ICU and the breakiterator > working for sentences (I just tried it right now and it does work, > but just not with the normal "khan" or "period" of Khmer rather it > works with Latin sentence markers which is not enough). I had > thought when we put in the code for the breakiterator that it also > covered the sentence, but I guess not (I will work towards getting it > working for Khmer). It may be worth modifying the CLDR definition - sentence breaks can be customised, though it is presently only done for Greek. However, if you want Khmer *sentence* rather than *clause* breaking, it will need a lot of work - papers are still being published on breaking Thai into sentences (e.g. www.mt-archive.info/Coling-2010-Slayden.pdf ). > In response to your comments: > > > 1) The user always marks word breaks with ZWSP. > > In this case, the ideal is to switch off the break iterator for the > > language. > > > There is some truth to this - and that is why I had it as my last > option (just turning the whole thing off). But the ICU breakiterator > for Khmer actually works quite well with normal language - it breaks > down when there are proper names. So turning it off is an option, but > not the most ideal solution. Some users will continue to always mark > breaks with a ZWSP (for full control), but I also think having the > option to turn it off for more complex sentences would be ideal. > > > 2) The user never marks word breaks. > > In this case, the user is totally dependent on the break iterator, > > and cannot be helped when it fails. > > As I said above, I think a both/and solution would be idea for Khmer. > But if in the end it would work better for Thai to have and "off" and > "on" option only, that would be fine for Khmer as well for now, until > we can come up with a more ideal solution. > > > > 3) The user only marks word breaks and non-word breaks when the > > iterator fails. > > The problem with this in Khmer is the user cannot tell when the > breakiterator fails, unless it is on a line-break. A word could be > broken up into three parts and the user would never know it. I usually notice iterator failures in Thai with unrecognised words, which prompts red ink over strange extents. Usually the words are not recognised because they're misspelt, but not always. The problem I see in Thai is usually not so much as extra word boundaries as misplaced word boundaries. > Actually, if users could see where the > breakiterator is breaking words, that would simplify things a lot. That is a very significant observation. > The only problem with this would be at the beginning of a document or > the beginning of any new "re-syncing" segment because you might run > into something like this: > User input (example in English so others can make sense of it I hope): > wordwordwordwordword. > How the sentence is broken up by the breakiterator: wo r d word word > wo rd word. > User adds ZWSP to fix broken word on line-break: wo r d word word > ZWSPwordword. This example confuses me. The problem here seems to be extra word breaks rather than missing word breaks, and I don't see how confirming a word break helps. > But user has no idea the first word is broken incorrectly and that it > is also spelled incorrectly. > This is why it would be best (I think) as Martin suggested that when > a ZWSP is detected it also turn off break iteration for the previous > words up until a re-sync point. This would practicly give the user > an "off" option for the whole document if they so chose, and without > the confusion of having to find some option in the Tools menu to turn > it on or off - it would just be automatic, depending on the user's > habit. I was clearly not clear enough. In the example above, 'wordwordwordwordword' is what I would call a dictionariless word - a word-breaker without a dictionary (e.g. a shell's parser) would see it as just one 'word'. Therefore, once ZWSP is inserted and word-breaking disabled, dictionary-based word-breaking is not applied to wordwordwordZWSPwordword, and, typically, red squiggles appear under wordwordword and wordword. The boundary may be revealed by a phase discontinuity or gap in the squiggle. Under the proposed scheme, user has to introduce another three ZWSPs even if the dictionary contains all the words. > I agree with this: > > > Considering these four use cases, it seems simplest to let ZWSP, WJ > > and ZWNBSP disable the iterator for the extent of the > > dictionariless word in which it occurs. > Except, it also should disable the breakiterator up to the previous > re-sync point... But that is what I meant! > But actually, there is a rule in ICU for the MAIYAMOK > so unless that is not working properly, I am not sure why LibreOffice > doesn't break correctly... I'll have to look further into this - and check that misbehaviour is still happening. Squiggl
Re: Adding Extension for Experimental Thai Spelling
On Thu, 27 Sep 2012 11:52:26 +0700 Nathan Wells wrote: >> 1. If you are shutting off the ICU breakiterator for text following, >> we >> should probably also do it for text preceding. Thus if there is a >> ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break >> iteration is disabled for the whole sentence. > Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU > break iteration should be disabled for the whole sentence. What is the logic of this? The use cases I see are: 1) The user always marks word breaks with ZWSP. In this case, the ideal is to switch off the break iterator for the language. 2) The user never marks word breaks. In this case, the user is totally dependent on the break iterator, and cannot be helped when it fails. 3) The user only marks word breaks and non-word breaks when the iterator fails. In this case, the iterator need only be switched off from the point of override until it can clearly re-synch. The obvious re-synching points are word external punctuation, such as end-of-line, white space, quotation marks, commas and dandas (and as dandas I would include U+0E2F THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5 KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai ฯลฯ and ฯเปฯ). Now, it may be easier to explain the rule if it applies to the whole 'word' - for what we are looking at is pretty much a 'word' as understood by dictionariless editors. 4) Different parts of the text comes from different sources - some mark word breaks, others expect the application to correctly identify them. A ZWSP in a chunk of text would then tag the text as having come from a a user in case 1 or 3; we have no reliable way of distinguishing the two cases. A WJ (U+2060) or ZWNBSP (U+FEFF) (when not a BOM, so paragraph initial is suspect) would strongly suggest use case 3 - but might occur in use case 1 if the user has had to fight a break iterator. (end of use cases) Considering these four use cases, it seems simplest to let ZWSP, WJ and ZWNBSP disable the iterator for the extent of the dictionariless word in which it occurs. What is the definition of an ICU sentence boundary? I see no evidence from CLDR 2.9 that it should be even approximately right for Khmer (or Thai). Splitting Thai text into sentences is known to be challenging - we can therefore expect different applications to split text differently. The one downside I can see to my suggestion is that if all word boundaries are marked, switching the iterator off dictionariless word by dictionariless word will require slightly greater use of WJ, for a ZWSP later in the sentence will not necessarily be in the same dictionariless word. A related issue that seems not to being handled is repetition mark U+0E46 THAI CHARACTER MAIYAMOK. It should be separated from the preceding alphabetic characters by a space, but Libreoffice doesn't recognised the sequence as a possible continuation of the word. Sometimes it is a necessary part of a word. I don't know what the situation is in Khmer. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
Dear Nathan, > Here are some new ideas, ordered by desirability, with number one being the > most desired, to number three being the least. > > 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator > for Khmer spell checking for characters following the zero-width space > until encounters real space (U+0020) or end of sentence (detect end of > sentence using ICU Sentence Boundary). I think this is a good direction to head. I have to follow on comments: 1. If you are shutting off the ICU breakiterator for text following, we should probably also do it for text preceding. Thus if there is a ZWSP or ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled for the whole sentence. 2. Why limit this to Khmer? I suspect as a model it should work for any non-space broken text. Yours, Martin > > 2) Disable use of ICU breakiterator for Khmer spell checking by default, > but allow users to enable it by adding a check-box to enable ICU > breakiterator in the Tools > Options > Language Settings > Writing Aids > > Options dialogue when a Khmer Hunspell dictionary is present ( > http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version > ). > > 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU > breakiterator for Khmer is more accurate. > > Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6 > it causes a lot of spelling errors to go unnoticed since the ICU > breakiterator breaks words up incorrectly. So hopfully we can find a > solution that will work with the current ICU breakiterator - though with > ICU 50.1 the breakiterator for Khmer will have some improvements. But I do > feel if solution 1 or 2 (or if someone else has better ideas) cannot > be implemented the breakiterator for spelling with Khmer should be turned > off in LibreOffice until the ICU breakiterator for Khmer is more accurate. > > > Thanks again for your help and time, your input is greatly appreciated! > > Sincerely, > > Nathan > > > > On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken wrote: > > > Dear All, > > > > > > An automatic word and line breaker is very necessary for Khmer and > > > > Thai because traditionally they have no spaces between words, and so > > > > line-breaking and spell checking require the use of a zero-width space > > > > between words which is counterintuitive for most native speakers, and > > > > so spell checking goes widely unused. > > > > I agree that automatic word breaking is a good thing and I am relieved to > > see that libreoffice does it based on language selection and not on > > automatic language guessing based on scripts. There are more languages that > > use Thai script and Khmer script than just Thai and Khmer. So one of my > > fears is already alleviated :) > > > > > > But now with the ICU code you implemented, Thai and Khmer can be > > > > automatically broken, and the results are quite good. But with its > > > > implementation in the real world, I have found some issues that I > > > > wanted to raise and also suggest possible solutions. I write this as > > > > an end-user, not so much as a programmer, nor do I claim to fully > > > > understand the inner-workings of ICU and LibreOffice (because I don't! > > > > ). > > > > > > > > First, I will do my best to explain the current results of the ICU > > > > break iterator with Khmer: > > > > > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > > > ឈ្មោះ|សិវកឥវលិយៈ > > > > > > > > The differences should be clear – the ICU break iterator does not > > > > break the words with 100% accuracy. > > > > > > > > One possible solution to this issue is by how the ICU Break Iterator > > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU > > > > code was enabled to automatically break Khmer, if an end-user wanted > > > > to spell check Khmer, they had to manually place U+200B characters to > > > > separate words. This solution worked quite well, but was > > > > counterintuitive to most native speakers, because Khmer has no spaces > > > > (as stated before). But with this solution, an end-user could be sure > > > > that their document was broken with 100% accuracy, if there was no > > > > human error (something automatic solutions cannot do – it is more > > > > along the lines of 80% accurate). What I propose, is that the break > > > > iterator code in LibreOffice looks for U+200B characters in a given > > > > string and considers them as a sign to NOT automatically break, but to > > > > allow the end-user full control to manually break words. Let me > > > > explain: > > > > > > > > 1. The code starts processing the text and automatically breaking > > > > it until it comes across a U+200B character. If one is found, > > > > it
Re: Adding Extension for Experimental Thai Spelling
Thanks for your input Richard, Firstly, you are right, I was mistaken about ICU and the breakiterator working for sentences (I just tried it right now and it does work, but just not with the normal "khan" or "period" of Khmer rather it works with Latin sentence markers which is not enough). I had thought when we put in the code for the breakiterator that it also covered the sentence, but I guess not (I will work towards getting it working for Khmer). In response to your comments: 1) The user always marks word breaks with ZWSP. > In this case, the ideal is to switch off the break iterator for the > language. There is some truth to this - and that is why I had it as my last option (just turning the whole thing off). But the ICU breakiterator for Khmer actually works quite well with normal language - it breaks down when there are proper names. So turning it off is an option, but not the most ideal solution. Some users will continue to always mark breaks with a ZWSP (for full control), but I also think having the option to turn it off for more complex sentences would be ideal. 2) The user never marks word breaks. > In this case, the user is totally dependent on the break iterator, and > cannot be helped when it fails. As I said above, I think a both/and solution would be idea for Khmer. But if in the end it would work better for Thai to have and "off" and "on" option only, that would be fine for Khmer as well for now, until we can come up with a more ideal solution. 3) The user only marks word breaks and non-word breaks when the iterator > fails. The problem with this in Khmer is the user cannot tell when the breakiterator fails, unless it is on a line-break. A word could be broken up into three parts and the user would never know it. This is why the issue is so complex. Actually, if users could see where the breakiterator is breaking words, that would simplify things a lot. Though I still think the option to turn the breakiterator "on" or "off" for certain sentences would be ideal (especially sentences with a ton of proper nouns where the ICU breakiterator for Khmer has the most trouble). As far as finding re-syncing points (when to turn the breakitorator back on when it is turned off by a ZWSP) I agree with you: > The obvious re-synching points > are word external punctuation, such as end-of-line, white space, > quotation marks, commas and dandas (and as dandas I would include U+0E2F > THAI CHARACTER PAIYANNOI in its role as angkhandeaw, as well as U+17D5 > KHMER SIGN BARIYOOSAN, though an exception may be worthwhile for Thai > ??? and ). The only problem with this would be at the beginning of a document or the beginning of any new "re-syncing" segment because you might run into something like this: User input (example in English so others can make sense of it I hope): wordwordwordwordword. How the sentence is broken up by the breakiterator: wo r d word word wo rd word. User adds ZWSP to fix broken word on line-break: wo r d word word ZWSPwordword. But user has no idea the first word is broken incorrectly and that it is also spelled incorrectly. This is why it would be best (I think) as Martin suggested that when a ZWSP is detected it also turn off break iteration for the previous words up until a re-sync point. This would practicly give the user an "off" option for the whole document if they so chose, and without the confusion of having to find some option in the Tools menu to turn it on or off - it would just be automatic, depending on the user's habit. I agree with this: > Considering these four use cases, it seems simplest to let ZWSP, WJ and > ZWNBSP disable the iterator for the extent of the dictionariless word > in which it occurs. Except, it also should disable the breakiterator up to the previous re-sync point to enable users to functionally "turn off" the breakitorator if they so choose (for Khmer this is necessary because for a book editor like myself, I will want to manually put the breaks and not let the breakitorator do anything automatically - but the feature is nice for the casual user because it is much faster and more intuitive to not type spaces between words for Cambodians). A related issue that seems not to being handled is repetition mark U+0E46 > THAI > CHARACTER MAIYAMOK. It should be separated from the preceding > alphabetic characters by a space, but Libreoffice doesn't recognised > the sequence as a possible continuation of the word. Sometimes it > is a necessary part of a word. I don't know what the situation is in > Khmer. In Khmer the repeat character (U+17D7 LEK TOO) is not separated from the preceding word by a space, but is connected, so this is not an issue for us. But actually, there is a rule in ICU for the MAIYAMOK so unless that is not working properly, I am not sure why LibreOffice doesn't break correctly... Here's the code from ICU4c for the Thai MAIYAMOK from dictbe.cpp if anyone is interested... if (uc == THAI_MAIYAMOK
Re: Adding Extension for Experimental Thai Spelling
Thanks Martin, 1. If you are shutting off the ICU breakiterator for text following, we > should probably also do it for text preceding. Thus if there is a ZWSP or > ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled > for the whole sentence. Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break iteration should be disabled for the whole sentence. 2. Why limit this to Khmer? I suspect as a model it should work for any > non-space broken text. I am only limiting it to Khmer because that is my expertise and I didn't want to cause problems for other languages - but it is possible these changes would be beneficial for other languages that are not broken by spaces (like Thai). Thanks, Nathan On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken wrote: > Dear Nathan, > > > Here are some new ideas, ordered by desirability, with number one being > the > > most desired, to number three being the least. > > > > 1) When a zero-width space is detected (U+200B), shut off ICU > breakiterator > > for Khmer spell checking for characters following the zero-width space > > until encounters real space (U+0020) or end of sentence (detect end of > > sentence using ICU Sentence Boundary). > > I think this is a good direction to head. I have to follow on comments: > > * 1. If you are shutting off the ICU breakiterator for text following, we > should probably also do it for text preceding. Thus if there is a ZWSP or > ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled > for the whole sentence. > > 2. Why limit this to Khmer? I suspect as a model it should work for any > non-space broken text.* > > Yours, > Martin > > > > > > > 2) Disable use of ICU breakiterator for Khmer spell checking by default, > > but allow users to enable it by adding a check-box to enable ICU > > breakiterator in the Tools > Options > Language Settings > Writing Aids > > > Options dialogue when a Khmer Hunspell dictionary is present ( > > > http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version > > ). > > > > 3) Disable use of ICU breakiterator for Khmer spell checking until the > ICU > > breakiterator for Khmer is more accurate. > > > > Currently, with the ICU breakiterator for Khmer enabled in LibreOffice > 3.6 > > it causes a lot of spelling errors to go unnoticed since the ICU > > breakiterator breaks words up incorrectly. So hopfully we can find a > > solution that will work with the current ICU breakiterator - though with > > ICU 50.1 the breakiterator for Khmer will have some improvements. But I > do > > feel if solution 1 or 2 (or if someone else has better ideas) cannot > > be implemented the breakiterator for spelling with Khmer should be turned > > off in LibreOffice until the ICU breakiterator for Khmer is more > accurate. > > > > > > Thanks again for your help and time, your input is greatly appreciated! > > > > Sincerely, > > > > Nathan > > > > > > > > On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken >wrote: > > > > > Dear All, > > > > > > > > An automatic word and line breaker is very necessary for Khmer and > > > > > Thai because traditionally they have no spaces between words, and > so > > > > > line-breaking and spell checking require the use of a zero-width > space > > > > > between words which is counterintuitive for most native speakers, > and > > > > > so spell checking goes widely unused. > > > > > > I agree that automatic word breaking is a good thing and I am relieved > to > > > see that libreoffice does it based on language selection and not on > > > automatic language guessing based on scripts. There are more languages > that > > > use Thai script and Khmer script than just Thai and Khmer. So one of my > > > fears is already alleviated :) > > > > > > > > But now with the ICU code you implemented, Thai and Khmer can be > > > > > automatically broken, and the results are quite good. But with its > > > > > implementation in the real world, I have found some issues that I > > > > > wanted to raise and also suggest possible solutions. I write this > as > > > > > an end-user, not so much as a programmer, nor do I claim to fully > > > > > understand the inner-workings of ICU and LibreOffice (because I > don't! > > > > > ). > > > > > > > > > > First, I will do my best to explain the current results of the ICU > > > > > break iterator with Khmer: > > > > > > > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > > > > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > > > > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > > > > ឈ្មោះ|សិវកឥវលិយៈ > > > > > > > > > > The differences should be clear – the ICU break iterator does not > > > > > break the words with 100% accuracy. > > > > > > > > > > One possible solution to this issue is by how the ICU Break > Iterator > > > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before > ICU > > > > > code was enab
Re: Adding Extension for Experimental Thai Spelling
Hello Again, Thank you all for your input! This is a deeper problem than I first thought...sorry for the delayed response, but I hope a solution can be found, even though the current ICU breakiterator is not at 100% for Khmer. Here are some new ideas, ordered by desirability, with number one being the most desired, to number three being the least. 1) When a zero-width space is detected (U+200B), shut off ICU breakiterator for Khmer spell checking for characters following the zero-width space until encounters real space (U+0020) or end of sentence (detect end of sentence using ICU Sentence Boundary). 2) Disable use of ICU breakiterator for Khmer spell checking by default, but allow users to enable it by adding a check-box to enable ICU breakiterator in the Tools > Options > Language Settings > Writing Aids > Options dialogue when a Khmer Hunspell dictionary is present ( http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version ). 3) Disable use of ICU breakiterator for Khmer spell checking until the ICU breakiterator for Khmer is more accurate. Currently, with the ICU breakiterator for Khmer enabled in LibreOffice 3.6 it causes a lot of spelling errors to go unnoticed since the ICU breakiterator breaks words up incorrectly. So hopfully we can find a solution that will work with the current ICU breakiterator - though with ICU 50.1 the breakiterator for Khmer will have some improvements. But I do feel if solution 1 or 2 (or if someone else has better ideas) cannot be implemented the breakiterator for spelling with Khmer should be turned off in LibreOffice until the ICU breakiterator for Khmer is more accurate. Thanks again for your help and time, your input is greatly appreciated! Sincerely, Nathan On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken wrote: > Dear All, > > > > An automatic word and line breaker is very necessary for Khmer and > > > Thai because traditionally they have no spaces between words, and so > > > line-breaking and spell checking require the use of a zero-width space > > > between words which is counterintuitive for most native speakers, and > > > so spell checking goes widely unused. > > I agree that automatic word breaking is a good thing and I am relieved to > see that libreoffice does it based on language selection and not on > automatic language guessing based on scripts. There are more languages that > use Thai script and Khmer script than just Thai and Khmer. So one of my > fears is already alleviated :) > > > > But now with the ICU code you implemented, Thai and Khmer can be > > > automatically broken, and the results are quite good. But with its > > > implementation in the real world, I have found some issues that I > > > wanted to raise and also suggest possible solutions. I write this as > > > an end-user, not so much as a programmer, nor do I claim to fully > > > understand the inner-workings of ICU and LibreOffice (because I don't! > > > ). > > > > > > First, I will do my best to explain the current results of the ICU > > > break iterator with Khmer: > > > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > > ឈ្មោះ|សិវកឥវលិយៈ > > > > > > The differences should be clear – the ICU break iterator does not > > > break the words with 100% accuracy. > > > > > > One possible solution to this issue is by how the ICU Break Iterator > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU > > > code was enabled to automatically break Khmer, if an end-user wanted > > > to spell check Khmer, they had to manually place U+200B characters to > > > separate words. This solution worked quite well, but was > > > counterintuitive to most native speakers, because Khmer has no spaces > > > (as stated before). But with this solution, an end-user could be sure > > > that their document was broken with 100% accuracy, if there was no > > > human error (something automatic solutions cannot do – it is more > > > along the lines of 80% accurate). What I propose, is that the break > > > iterator code in LibreOffice looks for U+200B characters in a given > > > string and considers them as a sign to NOT automatically break, but to > > > allow the end-user full control to manually break words. Let me > > > explain: > > > > > > 1. The code starts processing the text and automatically breaking > > > it until it comes across a U+200B character. If one is found, > > > it searches to see if there are any additional U+200B or U > > > +0020 characters in the following 20 characters (or so), and > > > if there are, the break iterator skips over those characters > > > and starts again from the second U+200B character (or U+0020, > > > but a U+0020 character would only signify the “close” of the > > > manual break because sometimes a
Re: Adding Extension for Experimental Thai Spelling
On Thu, 26 Jul 2012 16:33:00 +0700 Martin Hosken wrote: > 1. use of U+2060 makes string searching and spell checking harder > (unless WJ chars are stripped for searching and spell checking). They > are not part of the spelling of a word, so their introduction in the > underlying text stream is problematic for other text processing > processes (like searching as mentioned). This is less of an issue for > U+200B ZWSP because that occurs between words and searching across > word boundaries is a rarer activity. Likewise spell checking across > word boundaries isn't really needed. U+2060 WJ should definitely be skipped for searching and, once it has done its gluing job, spell-checking look-up, just like U+00AD SOFT HYPHEN. They're both indubitable complete ignorables for collation and therefore for UCA (Unicode Collation Algorithm) search. > Now what happens if I want to put zw around a word that occurs < 20 > chars after my last zw? The on off nature of the zw has now been > inverted. One option is to say that zw must always occur in pairs and > you would have to bracket your first or second word there. But then > management of which zw is on and which is off will get confusing for > users. I think that is the wrong way of looking at it. Various characters, some ZWSP, others more natural, such as SP, tell the break iterators where some word boundaries are. The rule we would have is that the break iterator should not try to break runs of less than, say, 20 characters if one of the boundaries is provided by ZWSP. I am not proposing that we limit how many breaks it makes in a run - 21 characters could be broken into seven words. The short runs the break iterator is prohibited from breaking can still be checked for spelling. If they are not words, then the user can respond to the red wiggly line appropriately, e.g. by putting extra word breaks in. In the example you gave, one would have to split the words between the delimited words. I think the users must accept that - the rule we would be working with is that the break iterator does not break short runs created by inserted ZWSP, and that is a simple rule to understand. I suppose there may be some question of what to count - base consonants perhaps? (In Unicode jargon, that would be extended default graphemes.) That might be a luxury feature we never need to add. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
Dear All, > > An automatic word and line breaker is very necessary for Khmer and > > Thai because traditionally they have no spaces between words, and so > > line-breaking and spell checking require the use of a zero-width space > > between words which is counterintuitive for most native speakers, and > > so spell checking goes widely unused. I agree that automatic word breaking is a good thing and I am relieved to see that libreoffice does it based on language selection and not on automatic language guessing based on scripts. There are more languages that use Thai script and Khmer script than just Thai and Khmer. So one of my fears is already alleviated :) > > But now with the ICU code you implemented, Thai and Khmer can be > > automatically broken, and the results are quite good. But with its > > implementation in the real world, I have found some issues that I > > wanted to raise and also suggest possible solutions. I write this as > > an end-user, not so much as a programmer, nor do I claim to fully > > understand the inner-workings of ICU and LibreOffice (because I don't! > > ). > > > > First, I will do my best to explain the current results of the ICU > > break iterator with Khmer: > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > ឈ្មោះ|សិវកឥវលិយៈ > > > > The differences should be clear – the ICU break iterator does not > > break the words with 100% accuracy. > > > > One possible solution to this issue is by how the ICU Break Iterator > > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU > > code was enabled to automatically break Khmer, if an end-user wanted > > to spell check Khmer, they had to manually place U+200B characters to > > separate words. This solution worked quite well, but was > > counterintuitive to most native speakers, because Khmer has no spaces > > (as stated before). But with this solution, an end-user could be sure > > that their document was broken with 100% accuracy, if there was no > > human error (something automatic solutions cannot do – it is more > > along the lines of 80% accurate). What I propose, is that the break > > iterator code in LibreOffice looks for U+200B characters in a given > > string and considers them as a sign to NOT automatically break, but to > > allow the end-user full control to manually break words. Let me > > explain: > > > > 1. The code starts processing the text and automatically breaking > > it until it comes across a U+200B character. If one is found, > > it searches to see if there are any additional U+200B or U > > +0020 characters in the following 20 characters (or so), and > > if there are, the break iterator skips over those characters > > and starts again from the second U+200B character (or U+0020, > > but a U+0020 character would only signify the “close” of the > > manual break because sometimes a phrase will end and there > > will be an actual space – so if the word that the user wants > > to manually break has a “real” U+0020 space at the end of it, > > then the user does not need to put an additional U+200B > > character to close it) which then repeats, looking for U+200B > > characters etc. > > > > 2. This would allow end-users to choose to manually break their > > whole document so they can have precise control, as well as > > allow end-users to place U+200B characters around names of > > people, places or transliterations in order to tell the break > > iterator to not try to break those words. In principle I like this approach. I like the idea of being able to force breaks and non-breaks. But I don't think we are quite there with this solution yet. Here are my difficulties with it: 1. use of U+2060 makes string searching and spell checking harder (unless WJ chars are stripped for searching and spell checking). They are not part of the spelling of a word, so their introduction in the underlying text stream is problematic for other text processing processes (like searching as mentioned). This is less of an issue for U+200B ZWSP because that occurs between words and searching across word boundaries is a rarer activity. Likewise spell checking across word boundaries isn't really needed. 2. How do we come up with the range of what is considered a word between two zwsp chars as opposed to two words? How close to the end of a string must a zwsp occur to disable all breaking before the end of the string? does "abcdefuvwxyz" block all breaks in the string? I think we need to think harder (deeper) about the use of zwsp in this way and see if we can come up with something with a little less ambiguity. Having said that, I think we are going to have to think really hard, because I don't think
Re: Adding Extension for Experimental Thai Spelling
Thanks for your reply. Yes, a "view->word boundaries" mode would be very helpful (or even incorporating the current "view->field shading" to include viewing 'gray marks' at the automatic ICU breaking so that users can see what is being done). Would this be hard to implement? Also, we are making some changes to the ICU break iterator dictionary for Khmer - and I've heard there will be some changes in ICU 50 which should improve results for Khmer. If anyone has any ideas - it would be appreciated. Thanks! Nathan On Wed, Jul 25, 2012 at 8:41 PM, Caolán McNamara wrote: > I'll cc this to the list if you don't mind, in order to archive it. I > have no immediate great ideas. But I wonder if a "view->word boundaries" > mode would be helpful, i.e. something that indicates the boundaries of > the words that the software thinks exist. > > On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote: > > > > I hope you don't mind if I write and ask some more questions and ask > > for additional help in making the break iterator more functional in > > LibreOffice. Thank you again for your help implementing ICU for Khmer > > in LibreOffice. I downloaded a recent beta build with your code > > implemented and did some testing – it is great! But it also brought to > > my attention some issues that hamper the useability of the automatic > > breaking for Khmer (and I also believe for Thai – see this discussion > > - > > > http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455 > ). > > > > > > An automatic word and line breaker is very necessary for Khmer and > > Thai because traditionally they have no spaces between words, and so > > line-breaking and spell checking require the use of a zero-width space > > between words which is counterintuitive for most native speakers, and > > so spell checking goes widely unused. > > But now with the ICU code you implemented, Thai and Khmer can be > > automatically broken, and the results are quite good. But with its > > implementation in the real world, I have found some issues that I > > wanted to raise and also suggest possible solutions. I write this as > > an end-user, not so much as a programmer, nor do I claim to fully > > understand the inner-workings of ICU and LibreOffice (because I don't! > > ). > > > > First, I will do my best to explain the current results of the ICU > > break iterator with Khmer: > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > ឈ្មោះ|សិវកឥវលិយៈ > > > > The differences should be clear – the ICU break iterator does not > > break the words with 100% accuracy. > > > > But, obviously with a dictionary approach, no automatic word breaker > > will ever break correctly 100% of the time. There is no solution that > > will currently automatically break Thai or Khmer 100% correctly (I > > have used, Hidden Markov Model breakers, dictionary probability > > breakers, and plain dictionary breakers – none work 100% of a time) > > because, especially for names and places, words in Khmer can just defy > > all rules and patterns. Perhaps in the future, a solution will arise > > that can break Khmer words with 100% accuracy, but at this time, we > > are far from any such solution. > > > > And this is an important reality to remember, because it > > differentiates Thai and Khmer (and possibly other languages that do > > not use spaces between words) from Western languages such as English, > > where a line-breaker and word-breaker can be correct 100% of the time. > > > > As an end user, this inability of the ICU break iterator to break > > Khmer words with 100% causes usability issues when it comes to > > correcting the automatic breaks that are broken in error. > > > > Here are some reasons why: > > > > 1. In LibreOffice a user cannot see where the words have been > > broken, they are invisible. > > > > 2. Therefore, trying to use a U+2060 (No Width Word Joiner) to > > correct an error in order to correctly spell check is very > > difficult, because the user cannot see where to place the > > joiner in order to join the word (as in the example case above > > the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters > > to join it to be treated as one word, but the end user does > > not know this because the breaks are invisible. > > FWIW with view->field shading on you should see a little gray mark where > the word joiner exists. At least I do anyway. > > > 1. Even if LibreOffice were able to change their code so that the > > end user could see the word-breaks, adding three U+2060 > > characters is quite laborious just to fix one word so that it > > can be spell checked correctly (as one word, rather than spell > > checked as four individual words). > > > > > > > > One
Re: Adding Extension for Experimental Thai Spelling
I'll cc this to the list if you don't mind, in order to archive it. I have no immediate great ideas. But I wonder if a "view->word boundaries" mode would be helpful, i.e. something that indicates the boundaries of the words that the software thinks exist. On Sun, 2012-07-15 at 21:40 +0700, Nathan Wells wrote: > > I hope you don't mind if I write and ask some more questions and ask > for additional help in making the break iterator more functional in > LibreOffice. Thank you again for your help implementing ICU for Khmer > in LibreOffice. I downloaded a recent beta build with your code > implemented and did some testing – it is great! But it also brought to > my attention some issues that hamper the useability of the automatic > breaking for Khmer (and I also believe for Thai – see this discussion > - > http://www.thaivisa.com/forum/topic/444360-thai-in-openoffice-on-ubuntu-lucid-lynx/#entry5160455). > > > > An automatic word and line breaker is very necessary for Khmer and > Thai because traditionally they have no spaces between words, and so > line-breaking and spell checking require the use of a zero-width space > between words which is counterintuitive for most native speakers, and > so spell checking goes widely unused. > But now with the ICU code you implemented, Thai and Khmer can be > automatically broken, and the results are quite good. But with its > implementation in the real world, I have found some issues that I > wanted to raise and also suggest possible solutions. I write this as > an end-user, not so much as a programmer, nor do I claim to fully > understand the inner-workings of ICU and LibreOffice (because I don't! > ). > > First, I will do my best to explain the current results of the ICU > break iterator with Khmer: > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > ឈ្មោះ|សិវកឥវលិយៈ > > The differences should be clear – the ICU break iterator does not > break the words with 100% accuracy. > > But, obviously with a dictionary approach, no automatic word breaker > will ever break correctly 100% of the time. There is no solution that > will currently automatically break Thai or Khmer 100% correctly (I > have used, Hidden Markov Model breakers, dictionary probability > breakers, and plain dictionary breakers – none work 100% of a time) > because, especially for names and places, words in Khmer can just defy > all rules and patterns. Perhaps in the future, a solution will arise > that can break Khmer words with 100% accuracy, but at this time, we > are far from any such solution. > > And this is an important reality to remember, because it > differentiates Thai and Khmer (and possibly other languages that do > not use spaces between words) from Western languages such as English, > where a line-breaker and word-breaker can be correct 100% of the time. > > As an end user, this inability of the ICU break iterator to break > Khmer words with 100% causes usability issues when it comes to > correcting the automatic breaks that are broken in error. > > Here are some reasons why: > > 1. In LibreOffice a user cannot see where the words have been > broken, they are invisible. > > 2. Therefore, trying to use a U+2060 (No Width Word Joiner) to > correct an error in order to correctly spell check is very > difficult, because the user cannot see where to place the > joiner in order to join the word (as in the example case above > the word សិវ|កឥ|វលិ|យៈ actually needs three U+2060 characters > to join it to be treated as one word, but the end user does > not know this because the breaks are invisible. FWIW with view->field shading on you should see a little gray mark where the word joiner exists. At least I do anyway. > 1. Even if LibreOffice were able to change their code so that the > end user could see the word-breaks, adding three U+2060 > characters is quite laborious just to fix one word so that it > can be spell checked correctly (as one word, rather than spell > checked as four individual words). > > > > One possible solution to this issue is by how the ICU Break Iterator > interacts with zero-width spaces (U+200B) in LibreOffice. Before ICU > code was enabled to automatically break Khmer, if an end-user wanted > to spell check Khmer, they had to manually place U+200B characters to > separate words. This solution worked quite well, but was > counterintuitive to most native speakers, because Khmer has no spaces > (as stated before). But with this solution, an end-user could be sure > that their document was broken with 100% accuracy, if there was no > human error (something automatic solutions cannot do – it is more > along the lines of 80% accurate). What I propose, is that the break > iterator code in LibreOffice lo
Re: Adding Extension for Experimental Thai Spelling
Thanks for your reply Caolán, I have submitted a bug and assigned you to it. I really appreciate you being willing to look into this! Here's the bug url: https://www.libreoffice.org/bugzilla/show_bug.cgi?id=52020 Please let me know if there is anything else I can provide. I have a little working knowledge of ICU, I helped implement the breakiterator for Khmer by providing the dictionary and tests, but I am not a programmer by trade. > There was something similar done in the past IIRC to > pass around soft-page-break information so that export filters could > know where the layout last put the page breaks. I forget the details of > that though. This would be a very useful feature for Cambodians (and I would assume Thai as well, although Thai tends to have more programs that currently support wordbreaking already) - would it be best to seek to do this with an extension rather than LibreOffice core? Thanks again for your time, Nathan On Thu, Jul 12, 2012 at 11:10 PM, Caolán McNamara [via Document Foundation Mail Archive] wrote: > On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote: > > I have two questions: is there a way to have the LibreOffice spelling > > checker (Hunspell) also recognize word-breaks using the ICU break > iterator > > for Khmer so that Cambodians no longer have to add zero-width spaces > > manually (as it seems to work for Thai now?)? Currently, lines without > > zero-width spaces are seen as one long word to the spelling checker in > > LibreOffice 3.6. But since the line-breaking is working, it would seem > > breaking words for the spelling checker should also be able to work. > Should > > I submit a bug? How should I proceed? > > Sounds like a bug really. I mean, hunspell itself generally doesn't do > the parsing of text into words, the app gives each word to hunspell. And > we're *supposed* to be using the icu breakiterator to split words. I > suspect its a similar bug as this original one. > > So... sure, file a bug, assign it to me ([hidden > email]<http://user/SendEmail.jtp?type=node&node=3995127&i=0>) > and paste a > short two word example text into the bug and indicate where the word > break should be and I'll add a regression test for it and see if its a > trivial fix for Khmer too now that we're using the latest-and-greatest > icu. > > > Also, since many other programs do not incorporate ICU's code, is there > a > > way to make the line breaks "real" when a document is saved in another > > format (such as a .doc?). And by "real" I mean that a zero-width space > is > > actually added to the text where a line-break should be. > > That should at least be theoretically possible, albeit a bit tricky > seeing as the layout code is the bit that knows the width of the page > and does the line breaking, while the export filters don't get to know > that information. There was something similar done in the past IIRC to > pass around soft-page-break information so that export filters could > know where the layout last put the page breaks. I forget the details of > that though. > > C. > > ___ > LibreOffice mailing list > [hidden email] <http://user/SendEmail.jtp?type=node&node=3995127&i=1> > http://lists.freedesktop.org/mailman/listinfo/libreoffice > > > ------ > If you reply to this email, your message will be added to the discussion > below: > > http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3995127.html > To unsubscribe from Adding Extension for Experimental Thai Spelling, click > here<http://nabble.documentfoundation.org/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3735637&code=c3VuZ2todW1AZ21haWwuY29tfDM3MzU2Mzd8LTE3NzAzNTQxNDk=> > . > NAML<http://nabble.documentfoundation.org/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3995138.html Sent from the Dev mailing list archive at Nabble.com.___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Sun, 2012-07-08 at 08:08 -0700, sungkhum wrote: > I have two questions: is there a way to have the LibreOffice spelling > checker (Hunspell) also recognize word-breaks using the ICU break iterator > for Khmer so that Cambodians no longer have to add zero-width spaces > manually (as it seems to work for Thai now?)? Currently, lines without > zero-width spaces are seen as one long word to the spelling checker in > LibreOffice 3.6. But since the line-breaking is working, it would seem > breaking words for the spelling checker should also be able to work. Should > I submit a bug? How should I proceed? Sounds like a bug really. I mean, hunspell itself generally doesn't do the parsing of text into words, the app gives each word to hunspell. And we're *supposed* to be using the icu breakiterator to split words. I suspect its a similar bug as this original one. So... sure, file a bug, assign it to me (caol...@redhat.com) and paste a short two word example text into the bug and indicate where the word break should be and I'll add a regression test for it and see if its a trivial fix for Khmer too now that we're using the latest-and-greatest icu. > Also, since many other programs do not incorporate ICU's code, is there a > way to make the line breaks "real" when a document is saved in another > format (such as a .doc?). And by "real" I mean that a zero-width space is > actually added to the text where a line-break should be. That should at least be theoretically possible, albeit a bit tricky seeing as the layout code is the bit that knows the width of the page and does the line breaking, while the export filters don't get to know that information. There was something similar done in the past IIRC to pass around soft-page-break information so that export filters could know where the layout last put the page breaks. I forget the details of that though. C. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
I hope no one minds if I "piggy-back" on this thread. Recently I contributed to the ICU break iterator for Khmer and it was added to ICU 4.8 (I just helped with the dictionary, another volunteer did the code). LibreOffice 3.6 added the updated ICU code and now uses the code to line-break Khmer even if zero-width spaces have not been provided. I have two questions: is there a way to have the LibreOffice spelling checker (Hunspell) also recognize word-breaks using the ICU break iterator for Khmer so that Cambodians no longer have to add zero-width spaces manually (as it seems to work for Thai now?)? Currently, lines without zero-width spaces are seen as one long word to the spelling checker in LibreOffice 3.6. But since the line-breaking is working, it would seem breaking words for the spelling checker should also be able to work. Should I submit a bug? How should I proceed? Also, since many other programs do not incorporate ICU's code, is there a way to make the line breaks "real" when a document is saved in another format (such as a .doc?). And by "real" I mean that a zero-width space is actually added to the text where a line-break should be. This also would make LibreOffice a great tool for Cambodians, since most do not like to type spaces between words (since the language traditionally doesn't have spaces), but would then allow them to use their work with other programs without having to manually type spaces between words. -- View this message in context: http://nabble.documentfoundation.org/Adding-Extension-for-Experimental-Thai-Spelling-tp3735637p3994303.html Sent from the Dev mailing list archive at Nabble.com. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Fri, 17 Feb 2012 14:10:21 + Caolán McNamara wrote: > On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote: > Indeed, yeah, I suppose, assuming its as complicated as "Thai", that > the right direction would be for someone to write for icu new > dictionary-based breakiterators for the "nod"(?) language and then the > rather trivial changes to LibreOffice to know about the language in > order to mark text as that language to bubble that info down to icu Northern Thai's not quite as simple or standardised as Siamese! One can meet (at least) the following spelling systems: 1) Chiangmai phonetics 2) Chiangrai phonetics (different mapping of tones to Siamese spelling rules) 3) Transliteration from Tai Tham script (probably rare for connected text) 4) Tai Tham script However, perhaps dictionary-based break iterators are something to be treated like dictionaries. There are several other writing systems that could probably benefit from them: Thai script: Northern Thai NE Thai (for recording songs - use of Siamese tone rules scrambles the tonemarks compared to Siamese cognates) Khmer script: Khmer - there's already a project for this set up on SourceForge. Pali Tai Tham script: Tai Khuen Tai Lue Pali Lao script Lao Tibetan script Tibetan I've a feeling Burmese may also have a need for dictionary based text breaking, though it's better behaved for syllable breaking than most of the others listed here. Shan would come in the same category. The above list is not exhaustive. Tai Lue in Lao script probably belongs in the list. Not all Thai script writing systems need a break iterator - some of the minority languages separate words with spaces, but that's partially a matter of literacy - Thais start writing Thai with interword gaps and then learn to suppress the gaps. Pali written in Thai also separates words with spaces - but Pali has some very long words! Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Thu, 2012-02-16 at 23:24 +, Richard Wordingham wrote: > I wouldn't expect a dictionary-based line breaker to handle words from > other languages. (There's a whole slew of Mon-Khmer languages in > Thailand, and they mostly use the Thai script when they happen to get > written.) Indeed, yeah, I suppose, assuming its as complicated as "Thai", that the right direction would be for someone to write for icu new dictionary-based breakiterators for the "nod"(?) language and then the rather trivial changes to LibreOffice to know about the language in order to mark text as that language to bubble that info down to icu C. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
Hi, 2012/2/17 Richard Wordingham : > It's a vast improvement - it gives LibreOffice a real Thai > spell-checker. Thank you. I have one worry for Siamese - Németh László > suggested that there might be a licensing issue back in > http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html . There is no problem with the license of the ICU. I'm also very glad of the fix. Regards, László ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Tue, 14 Feb 2012 16:19:17 + Caolán McNamara wrote: > I think this change: > http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12 > should improve matters a lot. It's a vast improvement - it gives LibreOffice a real Thai spell-checker. Thank you. I have one worry for Siamese - Németh László suggested that there might be a licensing issue back in http://openoffice.2283327.n4.nabble.com/Thai-line-breaking-td2791315.html . If there isn't such an issue, does this mean we can hope to see your fix in LibreOffice 3.5.1? > Makes "กุหลาบ" get treated as a single > word in the unit test there now anyway, though the Northern Thai one > is still not considered a single word, that might be due to the > oldish icu we're still using. I wouldn't expect a dictionary-based line breaker to handle words from other languages. (There's a whole slew of Mon-Khmer languages in Thailand, and they mostly use the Thai script when they happen to get written.) I can work my way round the problem using the sticking plaster of ZWSP and WJ (no-break no-space), and I think some use of them or an equivalent is inevitable when the sequence of visible characters doesn't define the breaks. In particular, after gluing กุ๊หลาบ together with WJ, Hunspell offered me กุหลาบ as a correction, which is good. There may be some rough edges with ZWSP and WJ going into the dictionary (TBC), but what you've done will justify LibreOffice claiming a Thai spell checking capability. Minority language support may not be compatible with libthai - at least one language uses a combining underline, and some of the mark combinations used for minority languages would get rejected by the WTT rules that libthai supports. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
Hi, On Tuesday, 2012-02-14 16:19:17 +, Caolán McNamara wrote: > We have some customized break iterator rules in LibreOffice, so we're > using those ones and *not* the built-in icu ones. But we lack a > customized Thai one, so we're using some ultra-generic word breaking > stuff for Thai and not going near the special built-into-icu Thai > iterator :-( Right, I think the generic customized one dates back from times where ICU didn't have a specialized Thai break iterator (not sure about that, but ...), so it should be good to have that switched to ICU for 'th'. Eike -- LibreOffice Calc developer. Number formatter stricken i18n transpositionizer. GnuPG key 0x293C05FD : 997A 4C60 CE41 0149 0DB3 9E96 2F1A D073 293C 05FD pgpKIpOYOxUeS.pgp Description: PGP signature ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Mon, 2012-02-13 at 22:39 +, Richard Wordingham wrote: > The spell-checker seems to break up a phrase consisting of just กุหลาบ > into 3 or 4 words. Hmm, so I played around with this and here's what I think is the problem... We have some customized break iterator rules in LibreOffice, so we're using those ones and *not* the built-in icu ones. But we lack a customized Thai one, so we're using some ultra-generic word breaking stuff for Thai and not going near the special built-into-icu Thai iterator :-( I think this change: http://cgit.freedesktop.org/libreoffice/core/commit/?id=475d0c59c66fb7752d230f76130b17145aad0c12 should improve matters a lot. Makes "กุหลาบ" get treated as a single word in the unit test there now anyway, though the Northern Thai one is still not considered a single word, that might be due to the oldish icu we're still using. After some googling I'm unsure if the "right way to go" to further improve Thai break iterators is to simply have another go at upgrading icu to get the latest and greatest there, or for "someone" to have a go at integrating libthai into LibreOffice and hand off break iteration for Thai to that. Either way, link above and related unit test give an entry point to the relevant code. C. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
Thank you to every one who's offered me advice. On Mon, 13 Feb 2012 15:08:20 + Caolán McNamara wrote: > I don't think we have any way to override our breakiterators from > extensions. Ah well, I'll just have to try to get Thai spell-checking working for myself and then worry about sharing my changes - assuming I succeed. > I'd be sort of interested in confirming that what we have right now > actually works correctly, in the sense that Thai text definitely *is* > getting run through the special Thai-specific icu word break handler. It's definitely going through a Siamese-specific word-breaker for line-breaking. For example the two-syllable Thai word กุหลาบ 'rose' moves to the next line, but when I convert it to the Northern Thai form กุ๊หลาบ (not the spelling I'd favour) by adding a (non-spacing) tone mark, it's promptly broken between lines along the syllable boundary, although the first syllable does not constitute a word, at least not one recorded in the Royal Institute Dictionary. I'm glad to find that inserting U+2060 WJ prevents that break. The spell-checker seems to break up a phrase consisting of just กุหลาบ into 3 or 4 words. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Sat, 2012-02-11 at 16:23 +, Richard Wordingham wrote: > Is it possible to create an experimental alternative to the Thai > break iterator that can be shared with other people as a LibreOffice > extension? I don't think we have any way to override our breakiterators from extensions. FWIW, i18npool/source/breakiterator is where we have our word, character, sentence and line break iterators implemented. Typically we forward everything on to icu to do the real work, albeit with some customization of the default icu rules. What I'd *expect* to happen is that text marked as "Thai" should end up getting broken into words by the default icu word break iterator, which at http://userguide.icu-project.org/boundaryanalysis claims "ICU provides a special dictionary-based break iterator." So, assuming that nothing is simply broken, improving the icu Thai break iterator should improve the libreoffice "for free". I'd be sort of interested in confirming that what we have right now actually works correctly, in the sense that Thai text definitely *is* getting run through the special Thai-specific icu word break handler. There is a i18npool/qa/cppunit/test_breakiterator.cxx which we use to make sure that some existing edge-cases continue to work. If you wanted to hack that to add some Thai word break tests that'd be helpful, and/or simply pass me on some sample text where we *are* doing the right thing and where we *aren't* and I could populate a test in there with that data and turn the problem into a developer friendly "enable this word-break unit test and make it work" problem. C. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On Sat, 2012-02-11 at 16:23 +, Richard Wordingham wrote: > As I understand it, the lack of a usable Thai spell-checker for > LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai > break iterator. In common with many, I know nothing about Thai ;-) but my friend Tim does - quite possibly he can help you ? (or do you know each other already) ? Thanks ! Michael [ who abnormally leaves the context intact for Tim ;-] > (I had expected Thai and Khmer to face similar > problems, for neither has a visible word separator and syllable > boundaries are often unclear in both.) Tagging Thai script text as > Khmer does not work (at least, not in Version 3.4.5); the word > boundaries are still determined by the Thai break iterator. > > Is it possible to create an experimental alternative to the Thai > break iterator that can be shared with other people as a LibreOffice > extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE > (ZWSP) to separate words in the Thai script, but I suspect Thais would > not. Also, I can seem my first useful version fouling up the > rendering of pre-existing text. I can't work out how to create a break > iterator as an *extension*. Could someone please advise me how, e.g. by > pointing to the documentation or an example. I can find documentation > for *publishing* an extension, but that does not address *creating* an > extension. > > Richard. -- michael.me...@suse.com <><, Pseudo Engineer, itinerant idiot ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Re: Adding Extension for Experimental Thai Spelling
On 11/02/12 17:23, Richard Wordingham wrote: > As I understand it, the lack of a usable Thai spell-checker for > LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai > break iterator. (I had expected Thai and Khmer to face similar > problems, for neither has a visible word separator and syllable > boundaries are often unclear in both.) Tagging Thai script text as > Khmer does not work (at least, not in Version 3.4.5); the word > boundaries are still determined by the Thai break iterator. > > Is it possible to create an experimental alternative to the Thai > break iterator that can be shared with other people as a LibreOffice > extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE > (ZWSP) to separate words in the Thai script, but I suspect Thais would > not. Also, I can seem my first useful version fouling up the > rendering of pre-existing text. I can't work out how to create a break > iterator as an *extension*. Could someone please advise me how, e.g. by > pointing to the documentation or an example. I can find documentation > for *publishing* an extension, but that does not address *creating* an > extension. hi Richard, while i don't know anything about break iterators, since OOo 3.0.1 there is a new grammar checking API, which AFAIK operates on a whole paragraph at a time; perhaps that API would make implementing a spelling checker for such languages easier (if LO cannot determine the word boundaries then the checker can always do it on its own). http://wiki.services.openoffice.org/wiki/Grammar_Checking http://www.openoffice.org/lingucomponent/grammar.html regards, michael ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice
Adding Extension for Experimental Thai Spelling
As I understand it, the lack of a usable Thai spell-checker for LibreOffice (unlike, say, a Khmer spell-checker) is due to the Thai break iterator. (I had expected Thai and Khmer to face similar problems, for neither has a visible word separator and syllable boundaries are often unclear in both.) Tagging Thai script text as Khmer does not work (at least, not in Version 3.4.5); the word boundaries are still determined by the Thai break iterator. Is it possible to create an experimental alternative to the Thai break iterator that can be shared with other people as a LibreOffice extension? I would be prepared to routinely use U+200B ZERO WIDTH SPACE (ZWSP) to separate words in the Thai script, but I suspect Thais would not. Also, I can seem my first useful version fouling up the rendering of pre-existing text. I can't work out how to create a break iterator as an *extension*. Could someone please advise me how, e.g. by pointing to the documentation or an example. I can find documentation for *publishing* an extension, but that does not address *creating* an extension. Richard. ___ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice