Hello,

I have a problem, apparently on an encoding issue, but can't figure out where it comes from. Could someone please help?

I'm reading from an XML file that contains the line

[1]     ...Bergson referred as "durée"; the way...

Then I parse the file with XML::DOM::Parser and print it out again.
The line now becomes:

[2]     ...Bergson referred as "dur㩥; the way...


Where can this possibly come from? Does "standard" reading and printing not produce UTF-8? And does XML::DOM::Parser not read input as UTF-8? So, when I print it out, should it not be UTF-8 again?

The file containing the first line was written like this:

        #!/usr/bin/perl
        use strict;
        use warnings;
        use encoding 'utf-8';

        my $infile = "file1.xml";
        open IN, "$infile" or die "\ncannot read specified infile\n";
        my $text = join "", <IN>;
        close IN;

        # some processing...

        my $outfile = "file2.xml";
        open OUT, ">:encoding(utf-8)", $outfile or die "cannot create out file";
        print OUT $text;
        close OUT;

        # alternatively I tried:
        # open IN, "<:encoding(utf-8)", "$infile"; # and
        # open OUT, ">$outfile" or die "cannot create out file";
        # respectively. It makes no difference.


The second script reads/writes like this:

        #!/usr/bin/perl
        use strict;
        use XML::DOM;
        use warnings;

        my $infile = "file2.xml";
        my $dom_parser = new XML::DOM::Parser();
        my $TREE = $dom_parser->parsefile($infile);

        open OUT, ">file3.xml" or die "could not open log file";
        print OUT $TREE->toString();
        close OUT;


Thanks for any comments!

Alois Heuboeck


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>


Reply via email to