On Thursday, Mar 20, 2003, at 11:26 US/Pacific, Kipp, James wrote:
I'm saying it could be bgcolor="COLOR" or bgcolor=COLOR
Yes I realize. I believe drieux's solution, or an adaptation of it, is what you need
note: I do subs because it is easier for me to 'loop on them' and if they are worth it, they get stuffed in a perl module somewhere...
[..]
#------------------------ # sub un_colour { my ($line) = @_; $line =~ s/\s*bgcolor=("?)([^">\s]+)("?)//gi ;
$line; } # end of un_colour
the usage would be
my $new_html_text = un_colour($html_text);
Or you could just use the line itself.
If it helps to break out the sequence
s/\s* # one or more white space before
bgcolor= # the specific text
("?) # first conditional group - "
([^">\s]+) # middle group -
("?) # third conditional group
//gisince the middle element needs to guard against
a. " b. > c. white space
Note that we are looking for at least one or more characters of the 'class' [^">\s] - or is english
not " :: let the 3rd group grab this
not > :: the end of tag token
not white space :: the end of attribute delimitersince we are looking for the set of characters that are 'not delimiters' - perchance the bass-end-akward way of making a set....
since <COLOR> in this context is both:
a. the secquence of alpha characters
b. a # preceeded hexit numeric sequenceI figured it would be easier to NOT go with the more complex regex that would need to note that 'if preceded by a #, then must be numeric...' Yeech, way to much work on that side of the trail.
The test case code had to include BOTH the ">" and the white space components so that it would correctly parse not merely the specific cases we are concerned about - but those cases in their 'natural enviornment' eg
<body bgcolor=red other="fred">
<body bgcolor=red>
<body bgcolor="red" other="fred">
<body bgcolor="#CCCCFF" other="fred">
<body bgcolor="#ccccff">
....remember that bgcolor is an attribute in a tag.
Or allow me to argue the defect in the initial idea
$line =~ s/ *bgcolor=("?)(.*)("?)//gi ;
the problem is that middle group - the "match one or more of anything... A very GREEDY GRAB - since it would take say
<body bgcolor="red" other="fred">
and make that
<bodyfred">
since the sequence - with the round braces delimiting the group matches:
/ bgcolor=(")(red" other=)(")/
is the most greedy grab possible. Which may have been what you were noticing in the output.
So the simplest solution appeared to be to work out the list of things that were 'delimiters' and then allow anything in the middle group that was not a delimiter...
HTH...
ciao drieux
---
-- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
