On Wednesday 15 August 2001 12:28, KAVANAGH, Michael wrote:

> Is there any way to handle entity references in attribute values?
>
> My problem is I have tags in my xml that contain entity references like
> this: <section name="text &reg;"></section>
> and XML::Parser generates the following error:
> undefined entity at line 6, column 33, byte 218:

> So obviously &reg isn't in the internal set of references, and it is
> generating an error. I'd like to set up a handler for this condition,  
> rather than having the programthan die!

Entity references in attribute values are always a problem in XML::Parser. 
You can trick the parser in thinking they are defined somewhere in an 
external DTD by using a fake one but then the entity disappears from the 
attribute value.

The only way I found was to use my own parsing method on the string that the 
parser recognized as the entire tag.

Here is a script that shows you a document that parses (in the __DATA__ 
section), wht XML:Parser gives you and what a custom parse can get. Note that 
until Unicode support for regexp is complete (perl 5.8) the custom parser 
will likely fail for document that include multi-byte characters.

One last comment: if you are dealing with XML with Perl you might want to 
subscribe to theperl-XML mailing list, info at 
http://listserv.ActiveState.com/mailman/listinfo/perl-xml

#!/bin/perl -w
use strict;

use XML::Parser 2.27;

my $p= new XML::Parser( Handlers => { Start => \&display_start_tag },
                        NoExpand => 1 );
$p->parse( \*DATA);

sub display_start_tag
  { my( $p, $tag, %atts)= @_;

    # what XML::Parser gives you
    print "recognized: ", $p->recognized_string(), "\n\n";
    print "tag: $tag\n";
    foreach my $att (keys %atts)
      { print "  $att: $atts{$att}\n"; }

    # what you can get 
    ($tag, %atts)= parse_start_tag(  $p->recognized_string());
    print "\ntag: $tag\n";
    foreach my $att (keys %atts)
      { print "  $att: $atts{$att}\n"; }

  }


sub parse_start_tag
  { my $string= shift;
    my( $gi, %atts);

    # get the gi (between < and the first space, / or > character)
    if( $string=~ s{^<\s*([^\s>/]*)[\s>/]*}{}s)
      { $gi= $1; }
    else
      { die "internal error when parsing start tag $string"; }
    while( $string=~ s{^([^\s=]*)\s*=\s*(["'])(.*?)\2\s*}{}s)
      { $atts{$1}= $3; }
    return $gi, %atts;
  }


__DATA__
<?xml version="1.0"?>
<!DOCTYPE section SYSTEM "toto.dtd">
<section name="text &reg;"></section>

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to