Hi Perlers,


I'm trying to do the following:

1- take an XML file
2- in one script, replace everything above Unicode #x7F (end of ASCII) with entity references (which can either have "special" names, like ä or be based on the Unicode nb. like ®)
3- then in another script, do some more transformations using XML::DOM and
4- print out resulting XML


My problem is that in the third step, when parsing its input, the XML::Parser seems to resolve those references that contain the HEX Unicode nb.; the "special name" references are not resolved.


My input looks somewhat like this:


        <?xml version="1.0" encoding="utf-8"?>
        <!DOCTYPE TEI.2 SYSTEM "E:/TEI.dtd">
        <TEI.2>
                <w:t>
                &auml; NetMachanic&#x00AE;technical evaluation
                </w:t>
                <w:t>
                &#x00E2;and LinkPopularity are two tools for organisation.
                </w:t>
                <w:t> &#x00E2;&#x00E2;&#x00E2;&#x00E2; </w:t>
                <w:t> &#x00AE;&#x00AE;&#x00AE;&#x00AE; </w:t>
        </TEI.2>



I tried the option NoExpand and also implemented a default handler, which "will be called when an entity reference is seen in text" (http://www.socsci.umn.edu/ssrf/doc/xml/enno-xml-docs/users.erols.com/enno/xml/XML/Parser/Expat.html),
so I have:

--------------------

#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;

my $infile = "INFILE.xml";
my $dom_parser = new XML::DOM::Parser(
                        NoExpand => 1,
                        Handlers => {
                                Default=>\&handle_default,
                                Char=>\&handle_char,
                        });

my $TREE = $dom_parser->parsefile($infile);

# here transform $TREE with XML::DOM

open OUT, ">OUTFILE.xml" or die "cannot write to OUT file";
print OUT $TREE->toString();
close OUT;



sub handle_char {

        my ($parser, $string) = @_;
        my $rec = $parser->recognized_string();
        my $esc = $parser->xml_escape($rec);

        open LOG, ">>log.txt";
        print LOG "\n--\ncall of handle_char()\n";
        print LOG "[$string||$rec//$esc]\n";
}

sub handle_default {

        my ($parser, $string) = @_;
        my $rec = $parser->recognized_string();
        my $esc = $parser->xml_escape($rec);

        open LOG, ">>log.txt";
        print LOG "\n--\ncall of handle_default()\n";
        print LOG "[$string||$rec//$esc]\n";
}


--------------------

Now, my problems:

First, handle_default() is not called for the entity references &#x00AE; and &#x00E2; but only for &auml;
&#x00AE; and &#x00E2; trigger handle_Char() instead.

Second, the NoExpand option does not what I thought it would, namely not expand the entity references.

Finally, the unresolved string in handle_Char() can be seen in $rec and $esc; the resolved one is in $string. But how can I get this out to $TREE? All the textbook examples of handlers I saw just printed out some message.


Another strange thing occurs in the last two <w:t> elements:
the first are four references to small letter a with circumflex; the second one four references to the REGISTERED TRADEMARK SIGN. What I get (when I don't set the Default and Char handlers) is:
<t> &#14467;â </t> for the first and
<t> ®®®® </t> four (R) for the second
In the first case, resolving the reference &#x00E2; seems to "eat" some of the following characters (also occurs when followed by normal character text).


Could anyone please give advice? Thanks,

Alois



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to