Re: [jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence

2006-02-21 Thread Richard Liang

Dear Tatyana,

As you may know, our (Harmony) implementation just wraps ICU4J's 
BreakIterator. And the rules of ICU4J's BreakIterator is compliant with 
Unicode TR29 which is different with the rules of RI.


This is a common issue for most of the classes in text. If we want 
implementation to have the same behavior as RI, we should get the rules 
of RI. However, I think the rules must be controlled by some kinds of 
license. So a better solution may be wrapping icu4j's implementation for 
all text (internationalization) classes. As I know, ICU4J is special for 
i18n.


Any comments? Thanks a lot.

Please refer to ICU's homepage: http://icu.sourceforge.net/

Richard Liang
China Software Development Lab, IBM



tatyana doubtsova (JIRA) wrote:

java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of 
the sentence
---

 Key: HARMONY-62
 URL: http://issues.apache.org/jira/browse/HARMONY-62
 Project: Harmony
Type: Bug
  Components: Classlib  
Reporter: tatyana doubtsova



Problem details:
java.text.BreakIterator.getSentenceInstance().next() stops searching for the 
sentence end, if the new-line character is found in the text and returns the 
index of the last seen non white space character. Due to j2se 1.4.2 method 
next() should return the boundary following the current boundary.

Code for reproducing Test.java:
import java.text.BreakIterator;
public class Test {
public static void main(String [] args)
{
BreakIterator it = BreakIterator.getSentenceInstance();
it.setText(One sentence \n on two lines.);
System.out.println(it.next());
}
}

Steps to Reproduce:
1. Build Harmony (check-out on 2006-01-30) j2se subset as described in 
README.txt.
2. Compile Test.java using BEA 1.4 javac
  

javac -d . Test.java


3. Run java using compatible VM (J9)
  

java -showversion Test



Output:
java version 1.4.2 (subset)
(c) Copyright 1991, 2005 The Apache Software Foundation or its licensors, as 
applicable.
14

Output on BEA 1.4.2 to compare with:
28

Suggested junit test case:

package org.apache.harmony.tests.java.text;

import java.text.BreakIterator;
import java.util.Locale;

import junit.framework.TestCase;

public class BreakIteratorTest extends TestCase {

public void test_next() {
// Regression test for HARMONY-30
BreakIterator bi = BreakIterator.getWordInstance(Locale.US);
bi.setText(This is the test, WordInstance);
int n = bi.first();
n = bi.next();
		assertEquals(Assert 0: next() returns incorrect value , 4, n); 


// Regression test for the current issue
bi = BreakIterator.getSentenceInstance();
bi.setText(One sentence \n on two lines.);
n = bi.next();
assertEquals(Assert 1: next() returns incorrect value , 28, 
n);
}
}


  


Re: [jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence

2006-02-21 Thread Art - Arthit Suriyawongkul
 As you may know, our (Harmony) implementation just wraps ICU4J's
 BreakIterator. And the rules of ICU4J's BreakIterator is compliant with
 Unicode TR29 which is different with the rules of RI.

 This is a common issue for most of the classes in text. If we want
 implementation to have the same behavior as RI, we should get the rules
 of RI. However, I think the rules must be controlled by some kinds of
 license. So a better solution may be wrapping icu4j's implementation for
 all text (internationalization) classes. As I know, ICU4J is special for
 i18n.

Imho, I don't think that different BreakIterator implementations have
to produce exactly the result (boundary analysis).

What I meant is, the Behavior of them should be all the same,
conform to what described in the Java API doc
  http://java.sun.com/j2se/1.5.0/docs/api/java/text/BreakIterator.html

 Line boundary analysis determines where ...
 Sentence boundary analysis allows ...
 Word boundary analysis is ...
 Character boundary analysis ...

But their result, the Boundary Analysis, need not to be the same,
just depends on how good each implementation could perform.

That's my opinion.

cheers,
Art

--
:: Art / Arthit Suriyawongkul
:: Applied Computational Linguistics Lab, Uni Potsdam
:: http://www.ling.uni-potsdam.de/acl-lab/
:: http://bact.blogspot.com/

**  Impeach Thaksin   http://tuthaprajan.org


[jira] Created: (HARMONY-62) java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of the sentence

2006-01-31 Thread tatyana doubtsova (JIRA)
java.text.BreakIterator.getSentenceInstance().next() treats '\n' as the end of 
the sentence
---

 Key: HARMONY-62
 URL: http://issues.apache.org/jira/browse/HARMONY-62
 Project: Harmony
Type: Bug
  Components: Classlib  
Reporter: tatyana doubtsova


Problem details:
java.text.BreakIterator.getSentenceInstance().next() stops searching for the 
sentence end, if the new-line character is found in the text and returns the 
index of the last seen non white space character. Due to j2se 1.4.2 method 
next() should return the boundary following the current boundary.

Code for reproducing Test.java:
import java.text.BreakIterator;
public class Test {
public static void main(String [] args)
{
BreakIterator it = BreakIterator.getSentenceInstance();
it.setText(One sentence \n on two lines.);
System.out.println(it.next());
}
}

Steps to Reproduce:
1. Build Harmony (check-out on 2006-01-30) j2se subset as described in 
README.txt.
2. Compile Test.java using BEA 1.4 javac
 javac -d . Test.java
3. Run java using compatible VM (J9)
 java -showversion Test

Output:
java version 1.4.2 (subset)
(c) Copyright 1991, 2005 The Apache Software Foundation or its licensors, as 
applicable.
14

Output on BEA 1.4.2 to compare with:
28

Suggested junit test case:

package org.apache.harmony.tests.java.text;

import java.text.BreakIterator;
import java.util.Locale;

import junit.framework.TestCase;

public class BreakIteratorTest extends TestCase {

public void test_next() {
// Regression test for HARMONY-30
BreakIterator bi = BreakIterator.getWordInstance(Locale.US);
bi.setText(This is the test, WordInstance);
int n = bi.first();
n = bi.next();
assertEquals(Assert 0: next() returns incorrect value , 4, 
n); 

// Regression test for the current issue
bi = BreakIterator.getSentenceInstance();
bi.setText(One sentence \n on two lines.);
n = bi.next();
assertEquals(Assert 1: next() returns incorrect value , 28, 
n);
}
}


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira