Hi Barry,

    Hello, it's those pesky Debian Lucene package maintainers again :-).
 Lucene currently builds and passes all but one unit test against
Kaffe[0] 1.1.6.  In debugging the failure of the unit test for
org.apache.analysis.ru.RussianStem, I enabled a build of the JUnit test
reports.  A detailed account is listed in Debian Bug Report #272295[1],
but in brief, the 7-character String of Cyrillic expected is matched for
the first five characters, then an issue occurs and what appears to be a
few thousand characters are spewed out and the unit test fails.  I have
a tarball of the unit test reports temporarily stored on my FTP site[2]
if anyone would care to take a look.
    Given the recent thread about UTF-8[3], I thought I would present
this to you guys to see if you might have any insight on the issue.
Thanks in advance for your time in reading this message.

Without downloading the tarball and digging into it, one bit of feedback is that Cyrillic has numerous encodings. A common source of problems is that text encoded using 8859-5 (for example) is getting identified as KOI8-R (or vice versa), so the conversion to Unicode fails on some characters.

As to the bug report, the HTML is tagged as UTF-8, but it looks like the text coming from the DB is using one of the legacy Cyrillic encodings. So my browser isn't very happy :)

-- Ken


[0] - http://www.kaffe.org
[1] - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=272295
[2] - ftp://www.bytemason.org/lucene_reports_2005092001.tar.gz
[3] -
http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200509.mbox/[EMAIL 
PROTECTED]


--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to