utf-8 working code... caution with existing data files.

James Bates 7 May 2002 17:38:40 -0000

Boys (and girls?),

I have a patch for the current Xindice CVS that supports reading/writing files 
in UTF-8 containing any Unicode characters you want into and out of Xindice. 
This means Greek, Hebrew, Korean, Chinese, Arabic, Russian, etc... To allow for 
this, I have had to modify the internal data format of Xindice files, meaning 
that existing Xindice databases will appear corrupt to Xindice patched with 
this new code...


It is however necessary in my opinion, as discussed in earlier posts, to 
migrate toward this.

In reality, this will affect ONLY databases that contain XML documents with 
NON-ASCII characters. ASCII characters are: English letters, Digits, 
punctuation marks,  Whitespaces, as well as some control characters like delete 
and backspace. There are 128 ASCII characters in all. So as long as you have 
databases using documents with only these characters, the patch won't affect 
your datafiles.

Typical non-ASCII characters, which will cause incompatibilities between old 
and new database files include: french, spanish etc... accented characters, 
such as �, �, �, �; currencies like �, EUR, �, non-breakable spaces (&nbsp; in 
HTML), fancy quotes �, �, copyright sign �, etc...

Because of these possible incompatiblities, I'd like to WARN people and try and 
co-ordinate applying them so as to cause as little disruption as possible. You 
can check them out already at
http://lambiek.amplexor.be/downloads/xindice/new-utf8-patch. 

I don't believe you NEED to use the XML-RPC client for just reading/writing 
documents, though I haven't really tested the CORBA client anymore... Using 
XPaths and XUpdates with non-ASCII characers will definately not work in CORBA, 
but should now work with XML-RPC interface. (Need to test some more myself 
though).

Anyway, let me know how and when I can commit this patch...

James

utf-8 working code... caution with existing data files.

Reply via email to