Re: [bug] encoding problems

Stefano Mazzocchi 26 Feb 2002 19:51:59 -0000

James Bates wrote:
> 
> Stefano: this is the kind of thing I address in the patch I "posted zipped" 
> to this list, and subsequently made available as a patch.


Sorry, I really thought they were already applied.
 
> As far as I know for the moment there are three issues relating to Xindice (I 
> too use Cocoon with Xindice, so I've seen exactly your problems too, in 
> French ;) :
> 
> 1) In the core database, documents are stored as Latin-1. The Java API 
> transforms java Strings (all unicode) into Latin-1 bytes and saves these into 
> the database files. This means that all Latin-1 (iso-8859-1) texts will be 
> stored correctly if the Java API is used. Any non-Latin1 characters (Polish, 
> Greek, Russian, ...) are stored as '?'.
> 
> My patch attempts to address this by storing all characters internally by 
> their UTF-8 representation rather than their Latin-1 representation, but it's 
> not perfect yet...

What's the problem on this? Maybe I can help since I the XML compilation
classes that I wrote for Cocoon are already UTF-8 based.

> 2) The command-line tools blatently assume that all XML files are Latin-1 
> encoded, regardless of the "encoding" pseudo-attribute in the XML 
> declaration. 

Yes, I 'blatantely' is the correct word :)

> However it is a simple matter to correct the source code to let Xerces sort 
> encodings out instead of Xindice: Xerces does it really well (auto-detecting 
> UTF-8, UTF-16 little and big endian, Latin-1, and a host of Asian encodings 
> too). On the output front, I changed the cmd-line tool to always output in 
> UTF-8, but a cleaner solution would be to let the user choose with a cmd-line 
> switch, defaulting to UTF-8.

I Don't get it: XML is *designed* to be encoding-safe. Why a database
client tool must have special command line parameters to indicate what
encoding that is while it's already indicated inside the document (and
if you read the XML spec there are a few encoding-guessing algorithms
explained there)

Besides: can't the client tool simply ask an XML parser to create SAX
events for you and then store those in the database?

That's how I would have designed it from scratch.

Is there any good reason why this isn't so?

> Remember: the command-line tool simply reads in the XML document to a Java 
> string: this Java string can still only contain Latin-1 defined code-points 
> as there the only ones Xindice can store internally (for the moment), even if 
> these charcaters were encoded in UTF-8 in your input XML.

Sorry but I don't get it: Java is entirely based on Unicode and
characters represented as are unsigned 16 bits. Since I've seen japanese
java strings with my eyes, I think you are mistaken saying that java
strings can only contain Latin-1 chars.
 
> 3) XPath and XUpdate instructions are sent through CORBA (the remote call API 
> used by Xindice's Java interfaces) as is (i.e. as "strings"). Unfortunately, 
> strings are 8-bit in CORBA. Unicode charcater strings should be typed as 
> wstrings, but for some reason (Kimbro has more on this: see an earlier post), 
> wstrings cause compatibility issues between different CORBA Implementations, 
> and so this doesn't work either. So even if you fix point 1), queries will 
> still not work.

Ok, that might be the reason.

I see this as a *BIG* push into the 'throw Corba away' direction.
 
> The only solution really worth considering here is moving from CORBA to 
> XML-RPC or SOAP, and this is far from over yet (though I'm working on it;) )

really? I thought we all agreed that SOAP/XML-RPC are far better options
than CORBA.
 
> For Latin-1 charcater only documents though (e.g. Italian, Portugese, 
> Swedish, German, French, Danish, Norwegian, etc...) you can get away with 
> ONLY patching the command-line tools to correctly convert your documents to 
> Java strings and back again.

I saw that Kimbro applied the patches, I'll look into it ASAP.
 
> As for getting it fixed in release 1.0, I'd have liked it too, but Kimbro 
> (rightly) prefers to wait, as making UTF-8 the database's internal encoding 
> breaks existing datafiules. There isn't really a reason not to just fix the 
> command-line tools though, thus already fixing Stefano's italian problem...
> Kimbro?

Ok for release early and often, but, please, write a 'known bugs' page
or you'll pretty soon be flooded with 'encoding-related' problems (as it
happened with Cocoon a while ago).

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: [bug] encoding problems

Reply via email to