Art, (Found your note languishing in my reader -- sorry it took so long to reply.)
While the wording of the Java spec may allow us to vary the behavior of the break iterator, it will be of cold comfort to any apps that we disrupt if the results are significantly different. As Richard points out, we are relying upon the fine folk in the ICU project to implement the break algorithms, and I know that they have done a lot of work to conform to the latest Unicode specs. However, where there is a significant difference (and this bug report may well be one of those cases) I believe we should tune ICU's default break iterator with some custom rules to better match the reference implementation behavior. Do you have any examples of applications that layout text which we could use as test cases? Regards, Tim Art - Arthit Suriyawongkul wrote: >> As you may know, our (Harmony) implementation just wraps ICU4J's >> BreakIterator. And the rules of ICU4J's BreakIterator is compliant with >> Unicode TR29 which is different with the rules of RI. >> >> This is a common issue for most of the classes in "text". If we want >> implementation to have the same behavior as RI, we should get the rules >> of RI. However, I think the rules must be controlled by some kinds of >> license. So a better solution may be wrapping icu4j's implementation for >> all text (internationalization) classes. As I know, ICU4J is special for >> i18n. > > Imho, I don't think that different BreakIterator implementations have > to produce exactly the result ("boundary analysis"). > > What I meant is, the "Behavior" of them should be all the same, > conform to what described in the Java API doc > http://java.sun.com/j2se/1.5.0/docs/api/java/text/BreakIterator.html > > Line boundary analysis determines where ... > Sentence boundary analysis allows ... > Word boundary analysis is ... > Character boundary analysis ... > > But their result, the "Boundary Analysis", need not to be the same, > just depends on how good each implementation could perform. > > That's my opinion. > > cheers, > Art > > -- > :: Art / Arthit Suriyawongkul > :: Applied Computational Linguistics Lab, Uni Potsdam > :: http://www.ling.uni-potsdam.de/acl-lab/ > :: http://bact.blogspot.com/ > > ** Impeach Thaksin http://tuthaprajan.org -- Tim Ellison ([EMAIL PROTECTED]) IBM Java technology centre, UK.