On Thu, 1 Feb 2001, Tom Kaiser wrote:

> Thanks VERY MUCH for these comments. Well the major drawback of the 
> solution based on iconv is the following. We're using expat as the xml 
> parser. If expat sees a document whose encoding is not among the 4 
> encodings it supports internally, it asks you for a certain table for 
> that encoding. The table has an item for each byte between 0x80 and 
> 0xff. The item for byte N specifies the length of the character codes in 
> which N appears as the leading byte.
> 
> Now the problem is that, sadly, we can't get this information from 
> iconv. At least for encodings which use codes of more than 2 bytes. So 
> in addition to iconv, we'd still need sort of "definition files" for 
> each supported encoding. I think it's clear that this greatly reduces 
> the advantage of using iconv. Obviously this is not a drawback of iconv 
> itself, but rather of iconv in combination with expat.

Tom, I can dig out some C code which will get the encoding from the XML,
then it would just be a matter of using iconv on the entire file to
convert to UTF8, then passing direct to expat without the xml
declaration. Would that help matters or is there something fundamental I'm
missing? (its probably slower because the file would have to be in memory
or something). I understand its not an ideal way to go though. I wish
expat were a bit more flexible in this regard.

> Then there are platforms where iconv needn't be available (portable 
> machines with an incomplete unix). For these, the other approach would 
> be ideal: you could even decide which encodings you would use, and 
> install only the corresponding files.
> 
> To reply to Mark's message: yes there's at least one widely used CKJ 
> encoding, called Chinese BG I believe, which is not covered by the 
> XML::Encoding module.
> 
> I'd like to make it clear that using *encoding files* from a Perl module 
> doesn't mean tying Sablotron to Perl. It would remain just as 
> standalone, except that these particular encoding files (1) are 
> available for use, (2) seem to represent a certain standard, and (3) can 
> be extended to include new encodings quite easily (it seems).
> 
> I'll appreciate any further comments. I hope it's understood that I'm 
> still trying to evaluate the possible ways rather than advocating any of 
> them. If I'm missing an elegant solution using iconv, I'll be more than 
> happy to learn about it.

Its good to get a more technical explanation, thanks. I'm guessing by most
people's replies that they are more concerned about output encoding than
input encoding.

-- 
<Matt/>

    /||    ** Director and CTO **
   //||    **  AxKit.com Ltd   **  ** XML Application Serving **
  // ||    ** http://axkit.org **  ** XSLT, XPathScript, XSP  **
 // \\| // **     Personal Web Site: http://sergeant.org/     **
     \\//
     //\\
    //  \\


Reply via email to