Hi Kent, I apologise for the not overly helpful initial post.
I had six possible uris to deal with - /thread/28742/ /thread/28742/?s=1291819247219837219837129 /thread/28742/5/ /thread/28742/5/?s=1291819247219837219837129 /thread/28742/?goto=lastpost /thread/28742/?s=1291819247219837219837129&goto=lastpost The only one I wanted to match was the first two. My initial pattern /thread/[0-9]*?/(\?s\=.*)?(?!lastpost)$ matched the first two and the last in redemo.py (which I've got stashed as a py2exe bundle, should I ever find myself sans Python but having to use regexes). I managed to sort it by using /thread /[0-9]*?/ (\?s\=\w*)?$ The s avoids the fourth possibility, and the \w precludes the & in the last uri. But, circumventing the problem irks me no end, as I haven't fixed what I was doing wrong, which means I'll probably do it again, and avoiding problems instead of resolving them feels too much like programming for the Win32 api to me. (Where removing a service from the service database doesn't actually remove the service from the service database until you open and close a handle to the service database a second time...) So yes, any advice on how to use negative lookaheads would be great. I get the feeling it was the .* before it. As for my problem with BeautifulSoup, I'm not sure what was happening there. It was happening in interactive console only, and I can't replicate it today, which suggests to me that I've engaged email before brain again. I do like BeautifulSoup, however. Although people keep telling about some XPath programme that's better, apparently, I like BeautifulSoup, it works. Regards, Liam Clarke On 12/18/05, Kent Johnson <[EMAIL PROTECTED]> wrote: > Liam Clarke wrote: > > Hi all, > > > > Using Beautiful Soup and regexes.. I've noticed that all the examples > > used regexes like so - anchors = parseTree.fetch("a", > > {"href":re.compile("pattern")} ) instead of precompiling the pattern. > > > > Myself, I have the following code - > > > >>>>z = [] > >>>>x = q.findNext("a", {"href":re.compile(".*?thread/[0-9]*?/.*", > > > > re.IGNORECASE)}) > > > > > >>>>while x: > > > > ... num = x.findNext("td", "tableColA") > > ... h = (x.contents[0],x.attrMap["href"],num.contents[0]) > > ... z.append(h) > > ... x = x.findNext("a",{"href":re.compile(".*?thread/[0-9]*?/.*", > > re.IGNORECASE)}) > > ... > > > > This gives me a correct set of results. However, using the following - > > > > > >>>>z = [] > >>>>pattern = re.compile(".*?thread/[0-9]*?/.*", re.IGNORECASE) > >>>>x = q.findNext("a", {"href":pattern)}) > > > > > >>>>while x: > > > > ... num = x.findNext("td", "tableColA") > > ... h = (x.contents[0],x.attrMap["href"],num.contents[0]) > > ... z.append(h) > > ... x = x.findNext("a",{"href":pattern} ) > > > > will only return the first found tag. > > > > Is the regex only evaluated once or similar? > > I don't know why there should be any difference unless BS modifies the > compiled regex > object and for some reason needs a fresh one each time. That would be odd and > I don't see > it in the source code. > > The code above has a syntax error (extra paren in the first findNext() call) > - can you > post the exact non-working code? > > > > (Also any pointers on how to get negative lookahead matching working > > would be great. > > the regex (/thread/[0-9]*)(?!\/) still matches "/thread/28606/" and > > I'd assumed it wouldn't. > > Putting these expressions into Regex Demo is enlightening - the regex matches > against > "/thread/2860" - in other words the "not /" is matching against the 6. > > You don't give an example of what you do want to match so it's hard to know > what a better > solution is. Some possibilities > - match anything except a digit or a slash - [^0-9/] > - match the end of the string - $ > - both of the above - ([^0-9/]|$) > > Kent > > > > > Regards, > > > > Liam Clarke > > _______________________________________________ > > Tutor maillist - Tutor@python.org > > http://mail.python.org/mailman/listinfo/tutor > > > > > > > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor