unwanted escaping of with XML::DOM
Hi I'm trying to feed text into an existing XML tree - the problem I'm encountering is that the text may contain entity references (including the 'forbidden' ''), in which case the is escaped by 'amp;'. I'm using the module XML::DOM for this. Here's an example of an empty tree (the file 0061a.xml): ?xml version=1.0 encoding=UTF-8? !DOCTYPE TEI.2 SYSTEM tei_bawe.dtd TEI.2 teiHeader titleStmt title/ /titleStmt /teiHeader text /text /TEI.2 --- Here's my script: #!/usr/bin/perl use strict; use XML::DOM; use warnings; my $titleText = Die Bruuml;cke.; my $infile = 0061a.xml; my $dom_parser = new XML::DOM::Parser; my $TREE = $dom_parser-parsefile($infile) or die \ncannot parse file input [$infile]\n; $TREE-normalize(); my $root = $TREE-getDocumentElement(); my $title = ${$root-getElementsByTagName(title, 1)}[0]; $title-addText($titleText); print $titleText\n; # for testing: Die Bruuml;cke. print $title-toString(); # for testing: titleDie Bramp;uuml;cke./title open OUT, 0061a.out.xml or die cannot write to OUT: $!; print OUT $TREE-toString(); --- The output file looks like this: ?xml version=1.0 encoding=UTF-8? !DOCTYPE TEI.2 SYSTEM tei_bawe.dtd TEI.2 teiHeader titleStmt titleDie Bramp;uuml;cke./title /titleStmt /teiHeader text /text /TEI.2 --- - whereas I'd like to get titleDie Bruuml;cke./title Thanks for any suggestions! Alois -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: unwanted escaping of with XML::DOM
Gunnar Hjalmarsson wrote: Alois Heuboeck wrote: I'm trying to feed text into an existing XML tree - the problem I'm encountering is that the text may contain entity references (including the 'forbidden' ''), in which case the is escaped by 'amp;'. I'm using the module XML::DOM for this. snip my $titleText = Die Bruuml;cke.; snip $title-addText($titleText); print $titleText\n; # for testing: Die Bruuml;cke. print $title-toString(); # for testing: titleDie Bramp;uuml;cke./title What if you simply say: my $titleText = 'Die Brücke.'; you mean resolving the entities first? It's a possibility, but at the moment I'd like to see the form of the input as a given. (The script I posted to the list is a simplified test version - in the 'real' script, the text comes from an external file.) Is there no 'standard' way around this problem? Alois I can't help thinking of my latest message to this list: http://www.mail-archive.com/beginners%40perl.org/msg91979.html -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/
Re: Replace only once.
On Fri, Dec 02, 2005 at 10:22:47AM +, Mads N. Vestergaard wrote: I have a script where I need to replace 45 in the beginning, with nothing in a variable It looks like this: #!/usr/bin/perl $modtager = 45247; $modtager =~s/45//; Then $modtager is 247, but if forinstance the number is 4545247, it should return 45247, how do I do this ? What is wrong with what you have? If it is not doing what you want you will have to explain in more detail what you want and what you are getting. if you wanted to make the replacement only right at the beginning of the string, you would use: $modtager =~s/^45//; alois -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
Re: Skipping blank lines while reading a flat file.
Dave, (this is one I know :-) ) I want to skip the blank lines and just print the lines with text, like this this is myfile This is my test case code. #!/usr/bin/perl -w use strict; my $opt_testfile=test-text.txt; open (TS, $opt_testfile) or die can't open file; while (TS) { chomp; next if ($_ =~ /^\s+/); next if ($_ =~ /^\s+/); You skip lines that BEGIN with a space. The REGEX you want is /^\s*$/ HTH Alois -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
encoding - utf-8
Hello, I have a problem, apparently on an encoding issue, but can't figure out where it comes from. Could someone please help? I'm reading from an XML file that contains the line [1] ...Bergson referred as durée; the way... Then I parse the file with XML::DOM::Parser and print it out again. The line now becomes: [2] ...Bergson referred as dur#14949;; the way... Where can this possibly come from? Does standard reading and printing not produce UTF-8? And does XML::DOM::Parser not read input as UTF-8? So, when I print it out, should it not be UTF-8 again? The file containing the first line was written like this: #!/usr/bin/perl use strict; use warnings; use encoding 'utf-8'; my $infile = file1.xml; open IN, $infile or die \ncannot read specified infile\n; my $text = join , IN; close IN; # some processing... my $outfile = file2.xml; open OUT, :encoding(utf-8), $outfile or die cannot create out file; print OUT $text; close OUT; # alternatively I tried: # open IN, :encoding(utf-8), $infile; # and # open OUT, $outfile or die cannot create out file; # respectively. It makes no difference. The second script reads/writes like this: #!/usr/bin/perl use strict; use XML::DOM; use warnings; my $infile = file2.xml; my $dom_parser = new XML::DOM::Parser(); my $TREE = $dom_parser-parsefile($infile); open OUT, file3.xml or die could not open log file; print OUT $TREE-toString(); close OUT; Thanks for any comments! Alois Heuboeck -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response
keep entity references while parsing with XML::Parser
Hi Perlers, I'm trying to do the following: 1- take an XML file 2- in one script, replace everything above Unicode #x7F (end of ASCII) with entity references (which can either have special names, like auml; or be based on the Unicode nb. like #x00AE;) 3- then in another script, do some more transformations using XML::DOM and 4- print out resulting XML My problem is that in the third step, when parsing its input, the XML::Parser seems to resolve those references that contain the HEX Unicode nb.; the special name references are not resolved. My input looks somewhat like this: ?xml version=1.0 encoding=utf-8? !DOCTYPE TEI.2 SYSTEM E:/TEI.dtd TEI.2 w:t auml; NetMachanic#x00AE;technical evaluation /w:t w:t #x00E2;and LinkPopularity are two tools for organisation. /w:t w:t #x00E2;#x00E2;#x00E2;#x00E2; /w:t w:t #x00AE;#x00AE;#x00AE;#x00AE; /w:t /TEI.2 I tried the option NoExpand and also implemented a default handler, which will be called when an entity reference is seen in text (http://www.socsci.umn.edu/ssrf/doc/xml/enno-xml-docs/users.erols.com/enno/xml/XML/Parser/Expat.html), so I have: #!/usr/bin/perl use strict; use XML::DOM; use warnings; my $infile = INFILE.xml; my $dom_parser = new XML::DOM::Parser( NoExpand = 1, Handlers = { Default=\handle_default, Char=\handle_char, }); my $TREE = $dom_parser-parsefile($infile); # here transform $TREE with XML::DOM open OUT, OUTFILE.xml or die cannot write to OUT file; print OUT $TREE-toString(); close OUT; sub handle_char { my ($parser, $string) = @_; my $rec = $parser-recognized_string(); my $esc = $parser-xml_escape($rec); open LOG, log.txt; print LOG \n--\ncall of handle_char()\n; print LOG [$string||$rec//$esc]\n; } sub handle_default { my ($parser, $string) = @_; my $rec = $parser-recognized_string(); my $esc = $parser-xml_escape($rec); open LOG, log.txt; print LOG \n--\ncall of handle_default()\n; print LOG [$string||$rec//$esc]\n; } Now, my problems: First, handle_default() is not called for the entity references #x00AE; and #x00E2; but only for auml; #x00AE; and #x00E2; trigger handle_Char() instead. Second, the NoExpand option does not what I thought it would, namely not expand the entity references. Finally, the unresolved string in handle_Char() can be seen in $rec and $esc; the resolved one is in $string. But how can I get this out to $TREE? All the textbook examples of handlers I saw just printed out some message. Another strange thing occurs in the last two w:t elements: the first are four references to small letter a with circumflex; the second one four references to the REGISTERED TRADEMARK SIGN. What I get (when I don't set the Default and Char handlers) is: t #14467;â /t for the first and t /t four (R) for the second In the first case, resolving the reference #x00E2; seems to eat some of the following characters (also occurs when followed by normal character text). Could anyone please give advice? Thanks, Alois -- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] http://learn.perl.org/ http://learn.perl.org/first-response