Thanks Martin,
1. If you are shutting off the ICU breakiterator for text following, we > should probably also do it for text preceding. Thus if there is a ZWSP or > ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled > for the whole sentence. Yes, I think you are right. If a ZWSP of ZWNBSP is detected then ICU break iteration should be disabled for the whole sentence. 2. Why limit this to Khmer? I suspect as a model it should work for any > non-space broken text. I am only limiting it to Khmer because that is my expertise and I didn't want to cause problems for other languages - but it is possible these changes would be beneficial for other languages that are not broken by spaces (like Thai). Thanks, Nathan On Thu, Sep 27, 2012 at 11:45 AM, Martin Hosken <martin_hos...@sil.org>wrote: > Dear Nathan, > > > Here are some new ideas, ordered by desirability, with number one being > the > > most desired, to number three being the least. > > > > 1) When a zero-width space is detected (U+200B), shut off ICU > breakiterator > > for Khmer spell checking for characters following the zero-width space > > until encounters real space (U+0020) or end of sentence (detect end of > > sentence using ICU Sentence Boundary). > > I think this is a good direction to head. I have to follow on comments: > > * 1. If you are shutting off the ICU breakiterator for text following, we > should probably also do it for text preceding. Thus if there is a ZWSP or > ZWNBSP (U+2060 WJ) anywhere in a text then ICU break iteration is disabled > for the whole sentence. > > 2. Why limit this to Khmer? I suspect as a model it should work for any > non-space broken text.* > > Yours, > Martin > > > > > > > 2) Disable use of ICU breakiterator for Khmer spell checking by default, > > but allow users to enable it by adding a check-box to enable ICU > > breakiterator in the Tools > Options > Language Settings > Writing Aids > > > Options dialogue when a Khmer Hunspell dictionary is present ( > > > http://extensions.libreoffice.org/extension-center/khmer-spelling-checker-sbbic-version > > ). > > > > 3) Disable use of ICU breakiterator for Khmer spell checking until the > ICU > > breakiterator for Khmer is more accurate. > > > > Currently, with the ICU breakiterator for Khmer enabled in LibreOffice > 3.6 > > it causes a lot of spelling errors to go unnoticed since the ICU > > breakiterator breaks words up incorrectly. So hopfully we can find a > > solution that will work with the current ICU breakiterator - though with > > ICU 50.1 the breakiterator for Khmer will have some improvements. But I > do > > feel if solution 1 or 2 (or if someone else has better ideas) cannot > > be implemented the breakiterator for spelling with Khmer should be turned > > off in LibreOffice until the ICU breakiterator for Khmer is more > accurate. > > > > > > Thanks again for your help and time, your input is greatly appreciated! > > > > Sincerely, > > > > Nathan > > > > > > > > On Thu, Jul 26, 2012 at 4:33 PM, Martin Hosken <martin_hos...@sil.org > >wrote: > > > > > Dear All, > > > > > > > > An automatic word and line breaker is very necessary for Khmer and > > > > > Thai because traditionally they have no spaces between words, and > so > > > > > line-breaking and spell checking require the use of a zero-width > space > > > > > between words which is counterintuitive for most native speakers, > and > > > > > so spell checking goes widely unused. > > > > > > I agree that automatic word breaking is a good thing and I am relieved > to > > > see that libreoffice does it based on language selection and not on > > > automatic language guessing based on scripts. There are more languages > that > > > use Thai script and Khmer script than just Thai and Khmer. So one of my > > > fears is already alleviated :) > > > > > > > > But now with the ICU code you implemented, Thai and Khmer can be > > > > > automatically broken, and the results are quite good. But with its > > > > > implementation in the real world, I have found some issues that I > > > > > wanted to raise and also suggest possible solutions. I write this > as > > > > > an end-user, not so much as a programmer, nor do I claim to fully > > > > > understand the inner-workings of ICU and LibreOffice (because I > don't! > > > > > ). > > > > > > > > > > First, I will do my best to explain the current results of the ICU > > > > > break iterator with Khmer: > > > > > > > > > > Input sentence: មានប្រាជ្ញាឈ្លាសវៃឈ្មោះសិវកឥវលិយៈ > > > > > > > > > > Current ICU line-breaking: មាន|ប្រាជ្ញាឈ្លាស|វៃ|ឈ្មោះ|សិវ|កឥ|វលិ|យៈ > > > > > > > > > > Compared with the sentence manually broken: មាន|ប្រាជ្ញា|ឈ្លាសវៃ| > > > > > ឈ្មោះ|សិវកឥវលិយៈ > > > > > > > > > > The differences should be clear – the ICU break iterator does not > > > > > break the words with 100% accuracy. > > > > > > > > > > One possible solution to this issue is by how the ICU Break > Iterator > > > > > interacts with zero-width spaces (U+200B) in LibreOffice. Before > ICU > > > > > code was enabled to automatically break Khmer, if an end-user > wanted > > > > > to spell check Khmer, they had to manually place U+200B characters > to > > > > > separate words. This solution worked quite well, but was > > > > > counterintuitive to most native speakers, because Khmer has no > spaces > > > > > (as stated before). But with this solution, an end-user could be > sure > > > > > that their document was broken with 100% accuracy, if there was no > > > > > human error (something automatic solutions cannot do – it is more > > > > > along the lines of 80% accurate). What I propose, is that the break > > > > > iterator code in LibreOffice looks for U+200B characters in a given > > > > > string and considers them as a sign to NOT automatically break, > but to > > > > > allow the end-user full control to manually break words. Let me > > > > > explain: > > > > > > > > > > 1. The code starts processing the text and automatically > breaking > > > > > it until it comes across a U+200B character. If one is > found, > > > > > it searches to see if there are any additional U+200B or U > > > > > +0020 characters in the following 20 characters (or so), > and > > > > > if there are, the break iterator skips over those > characters > > > > > and starts again from the second U+200B character (or > U+0020, > > > > > but a U+0020 character would only signify the “close” of > the > > > > > manual break because sometimes a phrase will end and there > > > > > will be an actual space – so if the word that the user > wants > > > > > to manually break has a “real” U+0020 space at the end of > it, > > > > > then the user does not need to put an additional U+200B > > > > > character to close it) which then repeats, looking for > U+200B > > > > > characters etc. > > > > > > > > > > 2. This would allow end-users to choose to manually break > their > > > > > whole document so they can have precise control, as well as > > > > > allow end-users to place U+200B characters around names of > > > > > people, places or transliterations in order to tell the > break > > > > > iterator to not try to break those words. > > > > > > In principle I like this approach. I like the idea of being able to > force > > > breaks and non-breaks. But I don't think we are quite there with this > > > solution yet. Here are my difficulties with it: > > > > > > 1. use of U+2060 makes string searching and spell checking harder > (unless > > > WJ chars are stripped for searching and spell checking). They are not > part > > > of the spelling of a word, so their introduction in the underlying text > > > stream is problematic for other text processing processes (like > searching > > > as mentioned). This is less of an issue for U+200B ZWSP because that > occurs > > > between words and searching across word boundaries is a rarer activity. > > > Likewise spell checking across word boundaries isn't really needed. > > > > > > 2. How do we come up with the range of what is considered a word > between > > > two zwsp chars as opposed to two words? How close to the end of a > string > > > must a zwsp occur to disable all breaking before the end of the string? > > > does "abcdef<zwsp>uvwxyz" block all breaks in the string? I think we > need > > > to think harder (deeper) about the use of zwsp in this way and see if > we > > > can come up with something with a little less ambiguity. Having said > that, > > > I think we are going to have to think really hard, because I don't > think > > > this is an easy problem. > > > > > > > > 4. I then notice that "ម្នាក់ទៀត" line breaks together (since > the > > > > > automatic line-breaking breaks them as one word. And I > decide > > > > > I would rather line-break after “ម្នាក់” rather than have > both > > > > > words break connected to each other, so I place a > zero-width > > > > > space between the words: > > > > > មានប្រាជ្ញាឈ្លាសវៃឈ្មោះ<zw>សិវកឥវលិយៈ<sp>អ្នកប្រាជ្ញ > > > > > ម្នាក់<zw>ទៀតដែលល្បីល្បាញជាងគេ > > > > > the automatic break iterator comes to the zero width space > and > > > > > then stops automatically breaking and look ahead to see if > > > > > there is a zero-width space or a “real” space within 20 > > > > > characters (this number might need refining, but I think 20 > > > > > characters would be enough). As there are no zero-width or > > > > > “real” spaces within 20 characters, the break iterator then > > > > > goes back to the previous zero-width and starts breaking > > > > > starting from the zero-width character. > > > > > > Now what happens if I want to put zw around a word that occurs < 20 > chars > > > after my last zw? The on off nature of the zw has now been inverted. > One > > > option is to say that zw must always occur in pairs and you would have > to > > > bracket your first or second word there. But then management of which > zw is > > > on and which is off will get confusing for users. > > > > > > An alternative model is to weight breakpoints. An explicit breakpoint > > > weighs more highly than an automatically generated one. Then when it > comes > > > to line breaking the weight of a breakpoint counts towards its choice > as to > > > the actual break. For example if we say an explicit break is 2 and an > > > automatic is 1. Then we might use a square rule for distance and say: > an > > > explicit break is preferred if it occurs closer to the end of a line > than > > > 4x the distance to the last automatic break on the line. Or somesuch. > > > > > > Yours, > > > Martin > > > >
_______________________________________________ LibreOffice mailing list LibreOffice@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/libreoffice