Matching XML Tag Contents with Regex

Chris Tue, 11 Dec 2007 08:16:41 -0800

I'm trying to find the contents of an XML tag. Nothing fancy. I don't
care about parsing child tags or anything. I just want to get the raw
text. Here's my script:


import re

data = """
<?xml version='1.0'?>
<body>
<div class='default'>
here&apos;s some text&#33;
</div>
<div class='default'>
here&apos;s some text&#33;
</div>
<div class='default'>
here&apos;s some text&#33;
</div>
</body>
"""

tagName = 'div'
pattern = re.compile('<%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*[^(%
(tagName)s)]*' % dict(tagName=tagName))

matches = pattern.finditer(data)
for m in matches:
    contents = data[m.start():m.end()]
    print repr(contents)
    assert tagName not in contents

The problem I'm running into is that the [^%(tagName)s]* portion of my
regex is being ignored, so only one match is being returned, starting
at the first <div> and ending at the end of the text, when it should
end at the first </div>. For this example, it should return three
matches, one for each div.

Is what I'm trying to do possible with Python's Regex library? Is
there an error in my Regex?

Thanks,
Chris
-- 
http://mail.python.org/mailman/listinfo/python-list

Matching XML Tag Contents with Regex

Reply via email to