"Jiri Fiser" <[EMAIL PROTECTED]> writes:
> I want to use your XML Perl module (1.5.1) for processing XML document
> which are
>
> writen in Czech (diploma thesis of my students). These documents will be
> wriiten in ISO Latin 2 (Linux) or CP 1250 (Windows) encodings and I want to
> transform it to UTF-8 encoding before the processing.
>
> I have tried your XML module and in standard condition (Linux RH 7.2,
> locale = cs_CZ) it's
>
> all OK, all strings were converted from UTF-8 to ISO Latin 2 (8859-2)
> (without
>
> error). Unfortunately I need an output in UTF-8. When I have tried the
> locale
>
> cs_CZ.utf8 with utf8 option in Perl 5.6.0, the output was in UTF-8, but
> strings with multibyte (=2 bytes for Czech) UTF-8 characters are reduced
> (shortened) on their end (probably by one character for each multibyte
> character in string
It seems that I never got around to answering your question. The
answer is that I bollocksed that unicode support for Xerces.
<technical-digression>
In order to get perl to talk to Xerces-C++ we build a layer of C++
glue code that talks both to the Perl C API and to the Xerces C++ API:
XML::Xerces <=> Perl C API <=> Glue Code <=> Xerces C++ API
the XML::Xerces <=> Perl C API is done automatically by perl. The glue
code is generated automatically for us by SWIG. In order to make it
nice and simple to write applications using XML::Xerces, we allow (or force)
people to use generic perl strings instead of the nasty brutish
DOMString's or the XMLCh*'s that the C++ API uses. And for every
method that passes in string arguments to Xerces SWIG intersperses
some code that converts Perl's strings into the DOMString objects or
that XMLCh*'s that Xerces wants. And for every method that returns a
DOMString or an XMLCh* SWIG converts it to a Perl string.
</technical-digression>
What I bollocksed was in the conversion routines I mashed everything
into plain old vanilla char*'s... Same old stupic american programmer
mistake. Until recently all the test files I had were either ASCII or
ISO-8859-1. So I didn't realize I had screwed up.
Here's an example of the typemap convertor code I mentioned:
%typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
if ( SvPOK( $source ) ) {
char *ptr = (char *)SvPV( $source,PL_na);
// ^^^^^^^^^^^^^^^^^^^^
// HERE THERE BE DRAGONS
$target = temp_qualifiedName = XMLString::transcode(ptr);
} else {
croak("Type error in argument 2 of $name, Expected perl-string.");
XSRETURN(1);
}
}
What I'll need to do is to put a test in there so the code looks like:
%typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
if ( SvPOK( $source ) ) {
if (SvUTF8($source)) {
// turn it into a UTF8 XMLCh*
} else {
// turn it into a ISO-8859-1 XMLCh*
}
} else {
croak("Type error in argument 2 of $name, Expected perl-string.");
XSRETURN(1);
}
}
I'm not sure of the exact code for the first case, so I'll have to
check. Luckily, because I use SWIG, I only have to change the code in
a few places.
Thanks for figuring this out,
jas.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]