From: Ben Siders <[EMAIL PROTECTED]>
> I've got a real easy one here (in theory). I have some XML files that
> were generated by a program, but generated imperfectly. There's some
> naked ampersands that need to be converted to &. I need a regexp
> that will detect them and change them. Sounds easy enough.
>
> The pattern I want to match is an ampersand that is NOT immediately
> followed by a few characters and then a semicolon. Any ideas?
>
> This is the best I've come up with so far. It should match an
> ampersand whose following characters, up to five, are not semicolons.
> I don't feel that this is a great solution. I'm hoping the community
> can think of a better one.
>
> $line =~ s/\&[^;]{,5}/\&/g;
>
> I'm hoping that'll match something like: "<tag>Blah data &</tag>",
> but NOT match "<tag>Blah &</tag>".
>
> I'm not sure if I'm on the right track here. I also can't match other
> escaped characters such as: "<tag>Copyright © 2003</tag>".
For something similar I use this (I have it inside a module):
use HTML::Entities;
sub PolishHTML {
my $str = shift;
if ($AllowXHTML) {
$str =~
s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^"
'><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*/?>|</\w[\w\d]*>|$)}
{HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-
~').$2}gem;
} else {
$str =~
s{(.*?)(&\w+;|&#\d+;|<\w[\w\d]*(?:\s+\w[\w\d]*(?:\s*=\s*(?:[^"
'><\s]+|(?:'[^']*')+|(?:"[^"]*")+))?)*\s*>|</\w[\w\d]*>|$)}
{HTML::Entities::encode($1, '^\r\n\t !\#\$%\"\'-;=?-
~').$2}gem;
}
return $str;
}
It escapes the &, < and > that doesn't seem to belong to HTML
entities or tags.
If you would use this over the XML you would want to set the
$AllowXHTML (or just use the first branch).
If all you want is to process the ampersand you may want something
like this:
$line =~ s/&(?!\w+;|#\d+;)/&/g;
Jenda
===== [EMAIL PROTECTED] === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]