Re: [jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence
Dear Tatyana, As you may know, our (Harmony) implementation just wraps ICU4J's BreakIterator. And the rules of ICU4J's BreakIterator is compliant with Unicode TR29 which is different with the rules of RI. This is a common issue for most of the classes in text. If we want implementation to have the same behavior as RI, we should get the rules of RI. However, I think the rules must be controlled by some kinds of license. So a better solution may be wrapping icu4j's implementation for all text (internationalization) classes. As I know, ICU4J is special for i18n. Any comments? Thanks a lot. Please refer to ICU's homepage: http://icu.sourceforge.net/ Richard Liang China Software Development Lab, IBM tatyana doubtsova (JIRA) wrote: java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence --- Key: HARMONY-62 URL: http://issues.apache.org/jira/browse/HARMONY-62 Project: Harmony Type: Bug Components: Classlib Reporter: tatyana doubtsova Problem details: java.text.BreakIterator.getSentenceInstance().next() stops searching for the sentence end, if the new-line character is found in the text and returns the index of the last seen non white space character. Due to j2se 1.4.2 method next() should return the boundary following the current boundary. Code for reproducing Test.java: import java.text.BreakIterator; public class Test { public static void main(String [] args) { BreakIterator it = BreakIterator.getSentenceInstance(); it.setText(One sentence \n on two lines.); System.out.println(it.next()); } } Steps to Reproduce: 1. Build Harmony (check-out on 2006-01-30) j2se subset as described in README.txt. 2. Compile Test.java using BEA 1.4 javac javac -d . Test.java 3. Run java using compatible VM (J9) java -showversion Test Output: java version 1.4.2 (subset) (c) Copyright 1991, 2005 The Apache Software Foundation or its licensors, as applicable. 14 Output on BEA 1.4.2 to compare with: 28 Suggested junit test case: package org.apache.harmony.tests.java.text; import java.text.BreakIterator; import java.util.Locale; import junit.framework.TestCase; public class BreakIteratorTest extends TestCase { public void test_next() { // Regression test for HARMONY-30 BreakIterator bi = BreakIterator.getWordInstance(Locale.US); bi.setText(This is the test, WordInstance); int n = bi.first(); n = bi.next(); assertEquals(Assert 0: next() returns incorrect value , 4, n); // Regression test for the current issue bi = BreakIterator.getSentenceInstance(); bi.setText(One sentence \n on two lines.); n = bi.next(); assertEquals(Assert 1: next() returns incorrect value , 28, n); } }
Re: [jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence
As you may know, our (Harmony) implementation just wraps ICU4J's BreakIterator. And the rules of ICU4J's BreakIterator is compliant with Unicode TR29 which is different with the rules of RI. This is a common issue for most of the classes in text. If we want implementation to have the same behavior as RI, we should get the rules of RI. However, I think the rules must be controlled by some kinds of license. So a better solution may be wrapping icu4j's implementation for all text (internationalization) classes. As I know, ICU4J is special for i18n. Imho, I don't think that different BreakIterator implementations have to produce exactly the result (boundary analysis). What I meant is, the Behavior of them should be all the same, conform to what described in the Java API doc http://java.sun.com/j2se/1.5.0/docs/api/java/text/BreakIterator.html Line boundary analysis determines where ... Sentence boundary analysis allows ... Word boundary analysis is ... Character boundary analysis ... But their result, the Boundary Analysis, need not to be the same, just depends on how good each implementation could perform. That's my opinion. cheers, Art -- :: Art / Arthit Suriyawongkul :: Applied Computational Linguistics Lab, Uni Potsdam :: http://www.ling.uni-potsdam.de/acl-lab/ :: http://bact.blogspot.com/ ** Impeach Thaksin http://tuthaprajan.org
[jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence
java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence --- Key: HARMONY-62 URL: http://issues.apache.org/jira/browse/HARMONY-62 Project: Harmony Type: Bug Components: Classlib Reporter: tatyana doubtsova Problem details: java.text.BreakIterator.getSentenceInstance().next() stops searching for the sentence end, if the new-line character is found in the text and returns the index of the last seen non white space character. Due to j2se 1.4.2 method next() should return the boundary following the current boundary. Code for reproducing Test.java: import java.text.BreakIterator; public class Test { public static void main(String [] args) { BreakIterator it = BreakIterator.getSentenceInstance(); it.setText(One sentence \n on two lines.); System.out.println(it.next()); } } Steps to Reproduce: 1. Build Harmony (check-out on 2006-01-30) j2se subset as described in README.txt. 2. Compile Test.java using BEA 1.4 javac javac -d . Test.java 3. Run java using compatible VM (J9) java -showversion Test Output: java version 1.4.2 (subset) (c) Copyright 1991, 2005 The Apache Software Foundation or its licensors, as applicable. 14 Output on BEA 1.4.2 to compare with: 28 Suggested junit test case: package org.apache.harmony.tests.java.text; import java.text.BreakIterator; import java.util.Locale; import junit.framework.TestCase; public class BreakIteratorTest extends TestCase { public void test_next() { // Regression test for HARMONY-30 BreakIterator bi = BreakIterator.getWordInstance(Locale.US); bi.setText(This is the test, WordInstance); int n = bi.first(); n = bi.next(); assertEquals(Assert 0: next() returns incorrect value , 4, n); // Regression test for the current issue bi = BreakIterator.getSentenceInstance(); bi.setText(One sentence \n on two lines.); n = bi.next(); assertEquals(Assert 1: next() returns incorrect value , 28, n); } } -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira