Re: Bug in UTF-8 output

Jason E. Stewart Sat, 27 Oct 2001 03:40:24 -0700

"Jiri Fiser" <[EMAIL PROTECTED]> writes:

> I want to use your XML Perl module (1.5.1) for processing XML document
> which are
> 
> writen in Czech (diploma thesis of my students). These documents will be
> wriiten in ISO Latin 2 (Linux) or CP 1250 (Windows) encodings and I want to
> transform it to UTF-8 encoding before the processing.
> 
> I have tried your XML module and in standard condition (Linux RH 7.2,
> locale = cs_CZ) it's
> 
> all OK, all strings were converted from UTF-8 to ISO Latin 2 (8859-2)
> (without
> 
> error). Unfortunately I need an output in UTF-8. When I have tried the
> locale
> 
> cs_CZ.utf8 with utf8 option in Perl 5.6.0, the output was in UTF-8, but
> strings with multibyte (=2 bytes for Czech) UTF-8 characters are reduced
> (shortened) on their end (probably by one character for each multibyte
> character in string


It seems that I never got around to answering your question. The
answer is that I bollocksed that unicode support for Xerces. 

<technical-digression>
In order to get perl to talk to Xerces-C++ we build a layer of C++
glue code that talks both to the Perl C API and to the Xerces C++ API:

XML::Xerces <=> Perl C API <=> Glue Code <=> Xerces C++ API

the XML::Xerces <=> Perl C API is done automatically by perl. The glue
code is generated automatically for us by SWIG. In order to make it
nice and simple to write applications using XML::Xerces, we allow (or force)
people to use generic perl strings instead of the nasty brutish
DOMString's or the XMLCh*'s that the C++ API uses. And for every
method that passes in string arguments to Xerces SWIG intersperses
some code that converts Perl's strings into the DOMString objects or
that XMLCh*'s that Xerces wants. And for every method that returns a
DOMString or an XMLCh* SWIG converts it to a Perl string.

</technical-digression>

What I bollocksed was in the conversion routines I mashed everything
into plain old vanilla char*'s... Same old stupic american programmer
mistake. Until recently all the test files I had were either ASCII or
ISO-8859-1. So I didn't realize I had screwed up.

Here's an example of the typemap convertor code I mentioned:

  %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
    if (  SvPOK( $source )  ) {
      char *ptr = (char *)SvPV( $source,PL_na);
  //  ^^^^^^^^^^^^^^^^^^^^      
  //  HERE THERE BE DRAGONS
      $target = temp_qualifiedName = XMLString::transcode(ptr);
    } else {
      croak("Type error in argument 2 of $name, Expected perl-string.");
      XSRETURN(1);
    }
  }

What I'll need to do is to put a test in there so the code looks like:

  %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
    if (  SvPOK( $source )  ) {
      if (SvUTF8($source)) {
        // turn it into a UTF8 XMLCh*
      } else {
        // turn it into a ISO-8859-1 XMLCh*
      }
    } else {
      croak("Type error in argument 2 of $name, Expected perl-string.");
      XSRETURN(1);
    }
  }

I'm not sure of the exact code for the first case, so I'll have to
check. Luckily, because I use SWIG, I only have to change the code in
a few places.

Thanks for figuring this out,
jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Bug in UTF-8 output

Reply via email to