Unicode compliance for Xerces.pm

Jason E. Stewart Sat, 27 Oct 2001 11:29:42 -0700

"Jason E. Stewart" <[EMAIL PROTECTED]> writes:

> What I'll need to do is to put a test in there so the code looks like:
> 
>   %typemap(perl5, in) const XMLCh* qualifiedName (XMLCh *temp_qualifiedName) {
>     if (  SvPOK( $source )  ) {
>       if (SvUTF8($source)) {
>         // turn it into a UTF8 XMLCh*
>       } else {
>         // turn it into a ISO-8859-1 XMLCh*
>       }
>     } else {
>       croak("Type error in argument 2 of $name, Expected perl-string.");
>       XSRETURN(1);
>     }
>   }


There are a couple of noteworthy consequences to this:

1  When Xerces returns a string that has high-order UTF-8 characters
   (i.e. when the chars are outside the ASCII range 0-127) I'll have to
   transcode from UTF-16 into UTF-8.
  
2. This will involve a lot of back and forth transcoding if the
   document contains a significant amount of high-order information. I
   don't know what affect this will have on the running time of the
   code. 
  
3. to make the glue code simple that passes arguments from perl and
   hands them to Xerces, I will always have to call the XMLCh*
   interfaces instead of the char* ones. 
  
4. Currently, if a Xerces API method returns a DOMString object or an
   XMLCh*, there is no way to keep that object, the glue code converts
   all of them into perl strings for 'convenience'. I think this is a
   feature, but it might turn out to be have performance benefits to
   allow users to keep them around. 

5. All of this is going to require the use of Perl-5.6.0 or better. I
   get a lot of notices from people still using 5.004 and 5.005, so
   this is going to mean upgrading for a lot of people. It is possible
   that I can make the code conditional and people with 5.005/4 could
   compile XML::Xerces but just not get unicode support. 

     I WILL NOT ATTEMPT THIS WITHOUT ASSISTANCE => it's a lot of work.

The issue with 4 is tricky. Perl is great about giving lots of
information about what context a method is being called in. For
example the following all look different to perl:

$a = foo(); // scalar context
@a = foo(); // list context
foo();      // void context

So within any method, I can figure out exactly what value to return to
best satisfy the user. It is tricky because these look identical to
Perl: 

my $dom_string  = $element->getAttribute('foo');
my $perl_string = $element->getAttribute('foo');

Even though the first should return a reference to
XML::Xerces::DOMString object, and the second should return a vanilla
perl string, there is no way to tell them apart. So I'll need more
help from the user.

I believe that it can be handled with a pragma similar to that of 

  use utf8;

maybe 

  use utf16;

The user could then turn it on for different pieces of the
application by using code blocks:

  # now we get perl strings from all functions
  my $perl_string = $element->getAttribute('foo');
  $perl_string .= 'nothing up my sleeve';
  $element->setAttribute($perl_string);
  {
    use utf16;
    # now we get DOMString's from all functions
    my $dom_string = $element->getAttribute('baz');
    $dom_string->appendData($perl_string);
    $element->setAttribute($dom_string);
  }
  # now we get perl strings from all functions
  my $perl_string2 = $element->getAttribute('bar');

jas.






---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Unicode compliance for Xerces.pm

Reply via email to