Re: Extracting links. - without modules

Chris Devers Mon, 17 Jan 2005 04:44:54 -0800

On Mon, 17 Jan 2005, Alexander Blüm wrote:

> this is also possible _without_ any modules, except maybe "strict".
> 
> # this will replace the contents of each match in @get
> foreach(@array){
>   my @get = $_ =~ /<a href="(.*?)">/g;
> }


What happens if the url has a doublequote followed by an angle bracket?

It's not likely, but it can happen, and it can work.

And if such a URL is discovered, this regex would break.

What happens if the url isn't wrapped in quotes at all?

This is much more likely, and again will work fine in browsers.

But again, this regex won't find it at all.

This kind of problem is why HTML (and XML) is really best processed 
using pre-written parser modules, such as HTML::SimpleLinkExtor. A 
parser has a much better shot at getting a proper view of the document 
than a simple regex pattern match.

Yes, you can approach such problems using simple regular expressions, 
such as what we have here, and in many cases they'll work, and maybe 
even work faster than the parser version would. On the other hand, this 
approach is much less generally robust: minor changes that don't break 
the HTML may break the regex, so you end up having to constantly adjust 
it to handle all the special cases that come up over time. 

If you just parse it at the outset, such as with HTML::SimpleLinkExtor, 
then the code should be simple, robust, and useful for a long time.

 
 

-- 
Chris Devers

-- 
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
<http://learn.perl.org/> <http://learn.perl.org/first-response>

Re: Extracting links. - without modules

Reply via email to