John Von Essen [mailto:[EMAIL PROTECTED]] wrote:

>       ## Email me for explaination of Regex
>       if($_ =~ m/http:\/\/([\w\d]+(-+[\w\d]+)?\.)+[\w]{2,3}(\/.*)?/)
                                                      ^^^^^  
> This will only print out internet urls like:
> 
> http://www.h-p.com.au/
> http://links.com/
> http://w.w.w.w.w.com/
> 
> NOT intranet urls like:
> 
> http://host/

Your regex is not up to date.
It would filter out valid internet URLs for the new top-level domains with
more than three letters like *.info.

Furthermore \w is not only letters but includes at least [a-zA-Z0-9_]. 
(It may include additional letters depending on your locale setting.)
Thus \d is a subset of \w.

And You don't check that 'http:' is at the beginning of the URL.
Thus, "http://localhost/script?http://www.inter.net/foo/bar"; would pass,
though it is an intranet URL.

I suggest the following regex:

m#^https?\://(\w+(-+\w+)?\.)+[a-z]{2,}(\:\d+)?(/|$)#i

(I've used # as paranthesis for readability).

Notes:  
 - URLs with a '_' in the hostname pass, though they are invalid IIRC.
 - URLs with raw IP numbers like http://127.0.0.0/ are rejected.
   It is probably best to check them by a second regex and then handle them
   depending on the IP number
 - it only covers http and https URLs. 
   Is it save to replace "^https?" by "^[a-z]+" ?
   Of course some internet URLs would still be rejected eg. "mailto:";
   But could some intranet URLs match the regex?

Ciao, Claus

Reply via email to