Barry Hawkins wrote:
Guys,
    Hello, it's those pesky Debian Lucene package maintainers again :-).
 Lucene currently builds and passes all but one unit test against
Kaffe[0] 1.1.6.  In debugging the failure of the unit test for
org.apache.analysis.ru.RussianStem, I enabled a build of the JUnit test
reports.  A detailed account is listed in Debian Bug Report #272295[1],
but in brief, the 7-character String of Cyrillic expected is matched for
the first five characters, then an issue occurs and what appears to be a
few thousand characters are spewed out and the unit test fails.  I have
a tarball of the unit test reports temporarily stored on my FTP site[2]
if anyone would care to take a look.
    Given the recent thread about UTF-8[3], I thought I would present
this to you guys to see if you might have any insight on the issue.
Thanks in advance for your time in reading this message.

[0] - http://www.kaffe.org
[1] - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=272295
[2] - ftp://www.bytemason.org/lucene_reports_2005092001.tar.gz

From the HTML failure report for org.apache.lucene.analysis.ru.TestRussianStem in [2] (spaces added to align information), and &lt; and &gt; shown as '<' and '>', resp.):

unicode expected:
  < &#1073; &#1077; &#1079; &#1076; &#1086; &#1084;  &#1085; >

but was:
  < &#1073; &#1077; &#1079; &#1076; &#1086; &#15364; &#15620; ...

Rewritten in hex notation:

unicode expected:
  < &#x431; &#x435; &#x437; &#x434; &#x43E; &#x43C;  &#x43D; >

but was:
  < &#x431; &#x435; &#x437; &#x434; &#x43E; &#x3C04; &#x3D04; ...

So, it appears to be the case that the last two of the seven characters have their byte order reversed: 3C04 versus 043C, and 3D04 versus 043D.

The next several characters in the output after the expected seven characters are:

 &#12292; &#3328;  &#2560;  &#12548; &#13572; ...

Rewritten as hex:

 &#x3004; &#x0D00; &#x0A00; &#x3104; &#x3504; ...

Byte swapped:

 &#x430;  &#xD;    &#xA;    &#x431;  &#x435;  ...

Transliterated into the Latin-1 alphabet, this is "a\r\nbe", where "\r" and "\n" are carriage return and newline, resp., and the "b" is the Cyrillic character that sounds like English "b".

So, it looks to me like the data following the expected output is extremely likely to be some form of intelligent data, which has simply been byte swapped.

I hope this helps -- I haven't got the time to investigate the code to connect this evidence to it.

Steve Rowe

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to