unwanted escaping of with XML::DOM

2008-02-09 Thread Alois Heuboeck


Hi

I'm trying to feed text into an existing XML tree - the problem I'm
encountering is that the text may contain entity references (including
the 'forbidden' ''), in which case the  is escaped by 'amp;'. I'm
using the module XML::DOM for this.



Here's an example of an empty tree (the file 0061a.xml):

?xml version=1.0 encoding=UTF-8?
!DOCTYPE TEI.2 SYSTEM tei_bawe.dtd
TEI.2
teiHeader
titleStmt
title/
/titleStmt
/teiHeader
text
/text
/TEI.2


---

Here's my script:

#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;


my $titleText = Die Bruuml;cke.;
my $infile = 0061a.xml;

my $dom_parser = new XML::DOM::Parser;
my $TREE = $dom_parser-parsefile($infile) or die \ncannot parse file
input [$infile]\n;
$TREE-normalize();

my $root = $TREE-getDocumentElement();
my $title = ${$root-getElementsByTagName(title, 1)}[0];

$title-addText($titleText);
print $titleText\n; # for testing: Die Bruuml;cke.
print $title-toString(); # for testing: titleDie
Bramp;uuml;cke./title

open OUT, 0061a.out.xml or die cannot write to OUT: $!;
print OUT $TREE-toString();


---

The output file looks like this:
?xml version=1.0 encoding=UTF-8?
!DOCTYPE TEI.2 SYSTEM tei_bawe.dtd
TEI.2
teiHeader
titleStmt
titleDie Bramp;uuml;cke./title
/titleStmt
/teiHeader
text
/text
/TEI.2


---

- whereas I'd like to get
titleDie Bruuml;cke./title


Thanks for any suggestions!

Alois


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: unwanted escaping of with XML::DOM

2008-02-09 Thread Alois Heuboeck



Gunnar Hjalmarsson wrote:

Alois Heuboeck wrote:


I'm trying to feed text into an existing XML tree - the problem I'm
encountering is that the text may contain entity references (including
the 'forbidden' ''), in which case the  is escaped by 'amp;'. I'm
using the module XML::DOM for this.


snip


my $titleText = Die Bruuml;cke.;


snip


$title-addText($titleText);
print $titleText\n; # for testing: Die Bruuml;cke.
print $title-toString(); # for testing: titleDie
Bramp;uuml;cke./title


What if you simply say:

my $titleText = 'Die Brücke.';


you mean resolving the entities first?

It's a possibility, but at the moment I'd like to see the form of the 
input as a given. (The script I posted to the list is a simplified test 
version - in the 'real' script, the text comes from an external file.)


Is there no 'standard' way around this problem?

Alois


I can't help thinking of my latest message to this list: 
http://www.mail-archive.com/beginners%40perl.org/msg91979.html




--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/




Re: Replace only once.

2005-12-02 Thread Alois Heuboeck




On Fri, Dec 02, 2005 at 10:22:47AM +, Mads N. Vestergaard wrote:



I have a script where I need to replace 45  in the beginning, with nothing 
in a variable


It looks like this:

#!/usr/bin/perl

$modtager = 45247;

$modtager =~s/45//;

Then $modtager is 247, but if forinstance the number is 4545247, it should 
return 45247, how do I do this ?



What is wrong with what you have?  If it is not doing what you want you
will have to explain in more detail what you want and what you are
getting.



if you wanted to make the replacement only right at the beginning of the 
string, you would use:


$modtager =~s/^45//;

alois

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




Re: Skipping blank lines while reading a flat file.

2005-10-07 Thread Alois Heuboeck

Dave,

(this is one I know :-) )



I want to skip the blank lines and just print the lines with text, like this
this
is 
myfile


This is my test case code. 
#!/usr/bin/perl -w

use strict;

my $opt_testfile=test-text.txt;
open (TS, $opt_testfile) or die can't open file;
while (TS) {
chomp;
next if ($_ =~ /^\s+/);


next if ($_ =~ /^\s+/);

You skip lines that BEGIN with a space.
The REGEX you want is
/^\s*$/

HTH
Alois

--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




encoding - utf-8

2005-09-19 Thread Alois Heuboeck


Hello,

I have a problem, apparently on an encoding issue, but can't figure out 
where it comes from. Could someone please help?


I'm reading from an XML file that contains the line

[1] ...Bergson referred as durée; the way...

Then I parse the file with XML::DOM::Parser and print it out again.
The line now becomes:

[2] ...Bergson referred as dur#14949;; the way...


Where can this possibly come from? Does standard reading and printing 
not produce UTF-8? And does XML::DOM::Parser not read input as UTF-8? 
So, when I print it out, should it not be UTF-8 again?


The file containing the first line was written like this:

#!/usr/bin/perl
use strict;
use warnings;
use encoding 'utf-8';

my $infile = file1.xml;
open IN, $infile or die \ncannot read specified infile\n;
my $text = join , IN;
close IN;

# some processing...

my $outfile = file2.xml;
open OUT, :encoding(utf-8), $outfile or die cannot create out file;
print OUT $text;
close OUT;

# alternatively I tried:
# open IN, :encoding(utf-8), $infile; # and
# open OUT, $outfile or die cannot create out file;
# respectively. It makes no difference.


The second script reads/writes like this:

#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;

my $infile = file2.xml;
my $dom_parser = new XML::DOM::Parser();
my $TREE = $dom_parser-parsefile($infile);

open OUT, file3.xml or die could not open log file;
print OUT $TREE-toString();
close OUT;


Thanks for any comments!

Alois Heuboeck


--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response




keep entity references while parsing with XML::Parser

2005-09-15 Thread Alois Heuboeck


Hi Perlers,


I'm trying to do the following:

1- take an XML file
2- in one script, replace everything above Unicode #x7F (end of ASCII) 
with entity references (which can either have special names, like 
auml; or be based on the Unicode nb. like #x00AE;)

3- then in another script, do some more transformations using XML::DOM and
4- print out resulting XML


My problem is that in the third step, when parsing its input, the 
XML::Parser seems to resolve those references that contain the HEX 
Unicode nb.; the special name references are not resolved.



My input looks somewhat like this:


?xml version=1.0 encoding=utf-8?
!DOCTYPE TEI.2 SYSTEM E:/TEI.dtd
TEI.2
w:t
auml; NetMachanic#x00AE;technical evaluation
/w:t
w:t
#x00E2;and LinkPopularity are two tools for organisation.
/w:t
w:t #x00E2;#x00E2;#x00E2;#x00E2; /w:t
w:t #x00AE;#x00AE;#x00AE;#x00AE; /w:t
/TEI.2



I tried the option NoExpand and also implemented a default handler, 
which will be called when an entity reference is seen in text 
(http://www.socsci.umn.edu/ssrf/doc/xml/enno-xml-docs/users.erols.com/enno/xml/XML/Parser/Expat.html), 


so I have:



#!/usr/bin/perl
use strict;
use XML::DOM;
use warnings;

my $infile = INFILE.xml;
my $dom_parser = new XML::DOM::Parser(
NoExpand = 1,
Handlers = {
Default=\handle_default,
Char=\handle_char,
});

my $TREE = $dom_parser-parsefile($infile);

# here transform $TREE with XML::DOM

open OUT, OUTFILE.xml or die cannot write to OUT file;
print OUT $TREE-toString();
close OUT;



sub handle_char {

my ($parser, $string) = @_;
my $rec = $parser-recognized_string();
my $esc = $parser-xml_escape($rec);

open LOG, log.txt;
print LOG \n--\ncall of handle_char()\n;
print LOG [$string||$rec//$esc]\n;
}

sub handle_default {

my ($parser, $string) = @_;
my $rec = $parser-recognized_string();
my $esc = $parser-xml_escape($rec);

open LOG, log.txt;
print LOG \n--\ncall of handle_default()\n;
print LOG [$string||$rec//$esc]\n;
}




Now, my problems:

First, handle_default() is not called for the entity references #x00AE; 
and #x00E2; but only for auml;

#x00AE; and #x00E2; trigger handle_Char() instead.

Second, the NoExpand option does not what I thought it would, namely not 
expand the entity references.


Finally, the unresolved string in handle_Char() can be seen in $rec and 
$esc; the resolved one is in $string.
But how can I get this out to $TREE? All the textbook examples of 
handlers I saw just printed out some message.



Another strange thing occurs in the last two w:t elements:
the first are four references to small letter a with circumflex; the 
second one four references to the REGISTERED TRADEMARK SIGN. What I get 
(when I don't set the Default and Char handlers) is:

t #14467;â /t for the first and
t  /t four (R) for the second
In the first case, resolving the reference #x00E2; seems to eat some 
of the following characters (also occurs when followed by normal 
character text).



Could anyone please give advice? Thanks,

Alois



--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
http://learn.perl.org/ http://learn.perl.org/first-response