I'll offer the following code to get you started on the task - and invite critiques/improvements! (cribbed from various sources and 'tuned' - note that either apostrophes or double quotes can be used to delimit the URL)
$bValidity = $iFound = preg_match_all( "/(href *= *['\"]?)([^'\" >]*)(['\" >])/i", $HTML, $aRegExOut ); if ( 0 < $iFound ) { $aA = $aRegExOut[2]; if ( DEBUG ) { ShowList( "Located", $aA ); } BTW I'm covering a case of finding multiple links in one piece of HTML. This can be dialed-back for single cases. The rest I'll leave to you. Regards, =dn ----- Original Message ----- From: "SpamSucks86" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: 20 February 2002 21:16 Subject: RE: [PHP] regexp on user supplied link > I absolutely hate regular expressions because I suck at writing > them...but I can help you with the logic. I was thinking search for a > pattern which matches HREF=" + any number of characters + ". Your match > would be HREF="blahblahblah". Then, you could go and chop off the HREF=" > and the lagging ", and then you are left with just a URL. Then, you can > use that built in url parser function (I forget its name, I think it > might be urlparse()). Then, see if there is no host, it's obviously a > relative link, otherwise, you can just see if the host matches or not. > This should work well. Good luck > > -----Original Message----- > From: Martin Towell [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, February 19, 2002 6:59 PM > To: '[EMAIL PROTECTED]'; php > Subject: RE: [PHP] regexp on user supplied link > > reg.ex. something like (not tested): > "<a[^>]*>" > this would give you the entire anchor tag, then go from there? > > or what about using the XML parsing routines, get it to find the anchors > and > give you it's attributes, then go from there? > > Martin > > -----Original Message----- > From: Justin French [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, February 20, 2002 10:46 AM > To: php > Subject: [PHP] regexp on user supplied link > > > Hi, > > I have a website which is based purely on user-added content. The > problem with this is that some areas allow users to use links in the > text, and it's difficult to ensure that they all have a decent knowledge > of attributes such as tartget="_new", etc etc. > > So, I'd like a script that... > > 1. looks at $text for any link tags, and for each tag, does the > following: > > 2. throws out everything except the HREF eg: > <A HREF="http://www.somesite.com" target="_new">click</a> becomes > http://www.somesite.com > <A HREF="javascript:something();"> becomes javascript:something(); > > 3. prefixe the url with <A HREF=" > > 4. establish if it's an internal or external link: so how do we > establish if it's an external link? well it'd be easy if we just say > "anything begining with http:// is not relative", but because this > content is user-driven, I'd like to be a little safer, and say "anything > that begins with http://www.mysite.com OR http://mysite.com" is an > external link. > > 5. if it's an external link, suffix the URL with " TARGET="_new">, or if > it's internal, suffix it with "> > > > Anyway, that'd be a great start. From there, I might like to prex each > external link to go thru a program called out.php to log affiliate > activity, and I might like to retain onmouseover, onclick, onmouseout > etc etc properties in the tag, I might like to ensure a session ID is > found within each internal link, and stripped from each external link, > ensure that the <A> has a matching </A> etc etc, but the above would be > a great start. > > > Any help, especially with steps 1, 2 & 4, would be much appreciated. > > > Thanks in advance, > > Justin French > http://indent.com.au > http://soundpimps.com > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > > -- > PHP General Mailing List (http://www.php.net/) > To unsubscribe, visit: http://www.php.net/unsub.php > > -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php