On Dec 11, 1:08 pm, "Diez B. Roggisch" <[EMAIL PROTECTED]> wrote: > Chris wrote: > > On Dec 11, 11:41 am, garage <[EMAIL PROTECTED]> wrote: > >> > Is what I'm trying to do possible with Python's Regex library? Is > >> > there an error in my Regex? > > >> Search for '*?' onhttp://docs.python.org/lib/re-syntax.html. > > >> To get around the greedy single match, you can add a question mark > >> after the asterisk in the 'content' portion the the markup. This > >> causes it to take the shortest match, instead of the longest. eg > > >> <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?[^(%(tagName)s)]* > > >> There's still some funkiness in the regex and logic, but this gives > >> you the three matches > > > Thanks, that's pretty close to what I was looking for. How would I > > filter out tags that don't have certain text in the contents? I'm > > running into the same issue again. For instance, if I use the regex: > > > <%(tagName)s\s[^>]*>[.\n\r\w\s\d\D\S\W]*?(targettext)+[^(% > > (tagName)s)]* > > > each match will include "targettext". However, some matches will still > > include </%(tagName)s)>, presumably from the tags which didn't contain > > targettext. > > Stop using the wrong tool for the job. Use lxml or BeautifulSoup to parse & > access HTML. > > Diez
I was hoping a simple pattern like <tag>.*text.*</tag> wouldn't be too complicated for Regex, but now I'm starting to agree with you. Parsing the entire XML Dom would probably be a lot easier. -- http://mail.python.org/mailman/listinfo/python-list