subject:"high performance hyperlink extraction"

Re: high performance hyperlink extraction

2005-09-13 Thread Bryan Olson

Adam Monsen wrote: > The following script is a high-performance link ( href="...">...) extractor. [...] > * extract links from text (most likey valid HTML) [...] > import re > import urllib > > whiteout = re.compile(r'\s+') > > # grabs hyperlinks from text > href_re = re.compile(r''' >

Re: high performance hyperlink extraction

2005-09-13 Thread felipevaldez

pretty nice, however, u wont capture the more and more common javascripted redirections, like click me nor http://www.yahoo.com";> nor http://www.yahoo.com"; name=x> . im guessing it also wont handle correctly thing like: click but you probably already knew all this stuff, didnt

high performance hyperlink extraction

2005-09-13 Thread Adam Monsen

The following script is a high-performance link (...) extractor. I'm posting to this list in hopes that anyone interested will offer constructive criticism/suggestions/comments/etc. Mainly I'm curious what comments folks have on my regular expressions. Hopefully someone finds this kind of thing as