Re: UTF-8 and unit test failure for org.apache.analysis.ru.RussianStem in build with Kaffe

Steven Rowe Thu, 22 Sep 2005 09:06:11 -0700

Barry Hawkins wrote:

Guys,
    Hello, it's those pesky Debian Lucene package maintainers again :-).
 Lucene currently builds and passes all but one unit test against
Kaffe[0] 1.1.6.  In debugging the failure of the unit test for
org.apache.analysis.ru.RussianStem, I enabled a build of the JUnit test
reports.  A detailed account is listed in Debian Bug Report #272295[1],
but in brief, the 7-character String of Cyrillic expected is matched for
the first five characters, then an issue occurs and what appears to be a
few thousand characters are spewed out and the unit test fails.  I have
a tarball of the unit test reports temporarily stored on my FTP site[2]
if anyone would care to take a look.
    Given the recent thread about UTF-8[3], I thought I would present
this to you guys to see if you might have any insight on the issue.
Thanks in advance for your time in reading this message.


[0] - http://www.kaffe.org
[1] - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=272295
[2] - ftp://www.bytemason.org/lucene_reports_2005092001.tar.gz

From the HTML failure report fororg.apache.lucene.analysis.ru.TestRussianStem in [2] (spaces added toalign information), and < and > shown as '<' and '>', resp.):


unicode expected:
  < &#1073; &#1077; &#1079; &#1076; &#1086; &#1084;  &#1085; >

but was:
  < &#1073; &#1077; &#1079; &#1076; &#1086; &#15364; &#15620; ...

Rewritten in hex notation:

unicode expected:
  < &#x431; &#x435; &#x437; &#x434; &#x43E; &#x43C;  &#x43D; >

but was:
  < &#x431; &#x435; &#x437; &#x434; &#x43E; &#x3C04; &#x3D04; ...

So, it appears to be the case that the last two of the seven charactershave their byte order reversed: 3C04 versus 043C, and 3D04 versus 043D.

The next several characters in the output after the expected sevencharacters are:


 &#12292; &#3328;  &#2560;  &#12548; &#13572; ...

Rewritten as hex:

 &#x3004; &#x0D00; &#x0A00; &#x3104; &#x3504; ...

Byte swapped:

 &#x430;  &#xD;    &#xA;    &#x431;  &#x435;  ...

Transliterated into the Latin-1 alphabet, this is "a\r\nbe", where "\r"and "\n" are carriage return and newline, resp., and the "b" is theCyrillic character that sounds like English "b".

So, it looks to me like the data following the expected output isextremely likely to be some form of intelligent data, which has simplybeen byte swapped.

I hope this helps -- I haven't got the time to investigate the code toconnect this evidence to it.


Steve Rowe

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: UTF-8 and unit test failure for org.apache.analysis.ru.RussianStem in build with Kaffe

Reply via email to