Re: [bug] encoding problems

James Bates 26 Feb 2002 21:48:54 -0000

>> My patch attempts to address this by storing all characters internally by
their UTF-8 >representation rather than their Latin-1 representation, but
it's not perfect yet...
>
>What's the problem on this? Maybe I can help since I the XML compilation
>classes that I wrote for Cocoon are already UTF-8 based.


(James) Well no technical problem really, the work is done. (check the
patch). It's just a bit dangerous to apply it now, as it will break existing
datafiles... Kimbro and I plan on introducing it after 1.0 ships.

>> 2) The command-line tools blatently assume that all XML files are Latin-1
encoded, regardless >of the "encoding" pseudo-attribute in the XML
declaration.
>
>Yes, I 'blatantely' is the correct word :)
>
>> However it is a simple matter to correct the source code to let Xerces
sort encodings out instead >of Xindice: Xerces does it really well
(auto-detecting UTF-8, UTF-16 little and big endian, Latin->1, and a host of
Asian encodings too). On the output front, I changed the cmd-line tool to
always >output in UTF-8, but a cleaner solution would be to let the user
choose with a cmd-line switch, >defaulting to UTF-8.
>
>I Don't get it: XML is *designed* to be encoding-safe. Why a database
>client tool must have special command line parameters to indicate what
>encoding that is while it's already indicated inside the document (and
>if you read the XML spec there are a few encoding-guessing algorithms
>explained there)

(James) I know: Xerces does all that. The proposed command-line tool option
is for OUTPUT
(xindiceadmin rd and xindiceadmin export commands). Then users that don't
like the default (utf-8) OUTPUT encoding, can still get their files in
latin-1 if they want (e.g. because they don't have/ don't like utf-8 text
editors). Input commands (xindiceadmin ad and xindiceadmin import) require
no such option, since as you point out, the information is contained in the
XML file, as per the XML spec.

The output commands option is is my wish, but it isn't done yet. For the
moment (i.e. since today) all output from the cmd-line tools is
unconditionally utf-8. No options. No choices.

>Besides: can't the client tool simply ask an XML parser to create SAX
>events for you and then store those in the database?

 That's more or less what happens now. Anyway, check this evenig's CVS (in
the main source tree): this point (command-line tools) is resolved now.

>> Remember: the command-line tool simply reads in the XML document to a
Java string: this Java >string can still only contain Latin-1 defined
code-points as there the only ones Xindice can store >nternally (for the
moment), even if these charcaters were encoded in UTF-8 in your input XML.

>Sorry but I don't get it: Java is entirely based on Unicode and
>characters represented as are unsigned 16 bits. Since I've seen japanese
>java strings with my eyes, I think you are mistaken saying that java
>strings can only contain Latin-1 chars.

Java strings can indeed contain any Unicode charcaters, that's what's cool
about them. However, due to point 1) above (a limitation in Xindice, not
Java, but a limitation that frustrates me as much as it does you, believe
me, whence my development contribution ;-) ), only those characters in the
Java string, that also exist in Latin-1, will actually make into the Xindice
data-files (which are byte-based). This is because (for the moment), the
strings are converted to byte arrays using the deprecated functions, such as
String.getBytes(/*nothing*/), and FileReader, and these byte arrays are then
used in all of the complex Tree/Symbol table/compression stuff that goes on
inside the Xindice datafiles.

>> 3) XPath and XUpdate instructions are sent through CORBA (the remote call
API used by >Xindice's Java interfaces) as is (i.e. as "strings").
Unfortunately, strings are 8-bit in CORBA. >Unicode charcater strings should
be typed as wstrings, but for some reason (Kimbro has more on >this: see an
earlier post), wstrings cause compatibility issues between different CORBA
>Implementations, and so this doesn't work either. So even if you fix point
1), queries will still not >work.

>Ok, that might be the reason.

>I see this as a *BIG* push into the 'throw Corba away' direction.

I agree 100%.

>> The only solution really worth considering here is moving from CORBA to
XML-RPC or >SOAP, and this is far from over yet (though I'm working on
it;) )

>really? I thought we all agreed that SOAP/XML-RPC are far better options
>than CORBA.

We agree, but the work isn't *done* yet  ;)
In fact XML-RPC isn't much good either, as it accepts only ASCII (yuck!) as
strings. (even worse than CORBA). I'm thus working flat out on a SOAP/WDSL
solution now. (I hope to have something presentable by Friday)


>> For Latin-1 charcater only documents though (e.g. Italian, Portugese,
Swedish, German, >French, Danish, Norwegian, etc...) you can get away with
ONLY patching the command-line >tools to correctly convert your documents to
Java strings and back again.

(James)Again, I'm referring here to the *actual characters* your document
can contain *regardless* of what encoding is used to represent them. So you
can still upload UTF-8 or UCS documents containing charcaters, as long as
the charcaters being represented are charcaters that also exist in Latin-1.
Your Italian utf-8 files for example should work fine.

>>I saw that Kimbro applied the patches, I'll look into it ASAP.

>> As for getting it fixed in release 1.0, I'd have liked it too, but Kimbro
(rightly) prefers to wait, as >making UTF-8 the database's internal encoding
breaks existing datafiules. There isn't really a >reason not to just fix the
command-line tools though, thus already fixing Stefano's italian problem...
>> Kimbro?

(James) this is now done.

>Ok for release early and often, but, please, write a 'known bugs' page
>or you'll pretty soon be flooded with 'encoding-related' problems (as it
>happened with Cocoon a while ago).

I know very well, I was one of the complainers ;) I agree with the "known
bugs"
page, also we should mention imho that we intend to fix it asap.


That's about as much as I can see toward answering the questions. Hope it
helps,

James

Re: [bug] encoding problems

Reply via email to