Re: utf-8 working code... caution with existing data files.

Stefano Mazzocchi 9 May 2002 15:57:38 -0000

James Bates wrote:
> 
> Boys (and girls?),
> 
> I have a patch for the current Xindice CVS that supports reading/writing 
> files in UTF-8 containing any Unicode characters you want into and out of 
> Xindice. This means Greek, Hebrew, Korean, Chinese, Arabic, Russian, etc... 
> To allow for this, I have had to modify the internal data format of Xindice 
> files, meaning that existing Xindice databases will appear corrupt to Xindice 
> patched with this new code...
> 
> It is however necessary in my opinion, as discussed in earlier posts, to 
> migrate toward this.
> 
> In reality, this will affect ONLY databases that contain XML documents with 
> NON-ASCII characters. ASCII characters are: English letters, Digits, 
> punctuation marks,  Whitespaces, as well as some control characters like 
> delete and backspace. There are 128 ASCII characters in all. So as long as 
> you have databases using documents with only these characters, the patch 
> won't affect your datafiles.
> 
> Typical non-ASCII characters, which will cause incompatibilities between old 
> and new database files include: french, spanish etc... accented characters, 
> such as �, �, �, �; currencies like �, EUR, �, non-breakable spaces (&nbsp; 
> in HTML), fancy quotes �, �, copyright sign �, etc...
> 
> Because of these possible incompatiblities, I'd like to WARN people and try 
> and co-ordinate applying them so as to cause as little disruption as 
> possible. You can check them out already at
> http://lambiek.amplexor.be/downloads/xindice/new-utf8-patch.
> 
> I don't believe you NEED to use the XML-RPC client for just reading/writing 
> documents, though I haven't really tested the CORBA client anymore... Using 
> XPaths and XUpdates with non-ASCII characers will definately not work in 
> CORBA, but should now work with XML-RPC interface. (Need to test some more 
> myself though).
> 
> Anyway, let me know how and when I can commit this patch...
> 
> James


+1 for committing as early as possible.

The trick would be a way to write a client that serializes the entire
database into a big XML file and another one in the new version that
allows import thru this XML dump file (which can use namespaces to
indicate xindice-specific data along the tree).

What do you think?

[NOTE: XIndice is totally useless to me today exactly because of proper
encoding and lack of available metadata... and I've met tons of people
that believe the exact same, so I'd suggest to patch these two things
then do a 1.1 release ASAP... this is very likely the reason why this
community is stagnating, so this might be a good thing to patch]

I volunteer to work on the metadata since I badly need it in the future.
Just don't know how to do it and I think the XML:DB API are slowing us
down rather than helping us in any way.

Comments?

-- 
Stefano Mazzocchi      One must still have chaos in oneself to be
                          able to give birth to a dancing star.
<[EMAIL PROTECTED]>                             Friedrich Nietzsche
--------------------------------------------------------------------

Re: utf-8 working code... caution with existing data files.

Reply via email to