hello,
i have recently posted some messages on iso-8859-2 encoding problems.
trying to solve that problem I encoded the latin2 xml document as UTF-8 and
did an AddDocument to xindice.
the behaviour is similar: the characters which happen to be in the iso-8859-1
(�, �, �) are alright. the ones that are specific to 8859-2 are replaced by
"?". this happens in the very file where XIndice holds its database.
this is probably caused by opening a Writer somewhere in the I/O part of
XIndice (i have not found yet the code which actually does this ) without
specifying an encoding.
as the default encoding is usually iso-8859-1, the latin2 texts are improperly
handled.
indeed, a solution is changing the file.encoding property for Java. for
instance, if i call java this way:
java -Dfile.encoding=utf-8
the problem disappears: the latin2 text is stored as utf-8 in the xindice db,
which is ok for me.
I wonder it would not be more proper to allow the user to choose the encoding
in which his text will be stored, and do something like:
Writer writer = new ...Writer(outputStream, "my-encoding-here")
in the I/O code of XIndice.
or, even better, look at the <?xml version=1.0 encoding="my-encoding-here" ?>
and use the given encoding when storing the document into XIndice.
otherwise, the majority will use, without knowing, the default encodings of
their machines.
best regards,
adrian.