Re: Unicode compliance for Xerces.pm

Jason E. Stewart Sun, 28 Oct 2001 19:25:40 -0800

I've had a chance to play around with Perl's Unicode support and the
issues of transcoding between Perl's UTF-8 and Xerces' UTF-16.

"Jason E. Stewart" <[EMAIL PROTECTED]> writes:

> 1  When Xerces returns a string that has high-order UTF-8 characters
>    (i.e. when the chars are outside the ASCII range 0-127) I'll have to
>    transcode from UTF-16 into UTF-8.
>   
> 2. This will involve a lot of back and forth transcoding if the
>    document contains a significant amount of high-order information. I
>    don't know what affect this will have on the running time of the
>    code. 

OK. These two are clearly dumb, because I have to transcode no matter
what. UTF-16 can't deal with straight ASCII because it's a two byte
format. So there's no *more* transcoding with UTF8 then there was with
just ASCII or ISO-8859-1.

> 3. to make the glue code simple that passes arguments from perl and
>    hands them to Xerces, I will always have to call the XMLCh*
>    interfaces instead of the char* ones. 

This I think is a good thing. No one will notice.

> 4. Currently, if a Xerces API method returns a DOMString object or an
>    XMLCh*, there is no way to keep that object, the glue code converts
>    all of them into perl strings for 'convenience'. I think this is a
>    feature, but it might turn out to be have performance benefits to
>    allow users to keep them around. 
> 
> 5. All of this is going to require the use of Perl-5.6.0 or better. I
>    get a lot of notices from people still using 5.004 and 5.005, so
>    this is going to mean upgrading for a lot of people. It is possible
>    that I can make the code conditional and people with 5.005/4 could
>    compile XML::Xerces but just not get unicode support. 
> 
>      I WILL NOT ATTEMPT THIS WITHOUT ASSISTANCE => it's a lot of
>      work.

Seriously. This is a lot of work. I'm not even going to think about
this unless someone pipes up.

> The issue with 4 is tricky. Perl is great about giving lots of
> information about what context a method is being called in. For
> example the following all look different to perl:
> 
> I believe that it can be handled with a pragma similar to that of 
> 
>   use utf8;
> 
> maybe 
> 
>   use utf16;

I think this is dumb. It should just work. All methods that return a
DOMString object in C++ should do the same in Perl, except that in
Perl you should be able to use a DOMString object in *exactly* the
same way that you would use a Perl string:

my $dom_string = $element->getAttribute('foo');
$dom_string .= 'bar'; # concatenation
print STDERR "The new value for foo is: $dom_string\n"; # stringify

etc...

The same for XMLCh*. There should just be a class, maybe
XML::Xerces::XMLCh or XML::Xerces::XMLString and any method that
returns an XMLCh* would just return an XML::Xerces::XMLString
instance. 

This would mean a lot less transcoding, AND it would solve some of the
issues with Perl and Unicode. Both Perl 5.6.0 and 5.6.1 have
significant unicode bugs. It's not only the 5.7.2 development series
that many really key patches come in. 

jas.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Unicode compliance for Xerces.pm

Reply via email to