Re: [PHP] Regex pattern for preg_match_all
Le 19/02/2011 0:23, Tommy Pham a écrit : @Simon, Thanks for explaining about the [^href]. I need to read up more about greediness. I thought I understood it but guess not. @Peter, I tried your pattern but it didn't capture all of my new test cases. Also, it captures the single/double quotes in addition to the fragments inside the href. I couldn't figure out how to modify your pattern to exclude the ', , and URL fragment from group 1 matches. Below is the new pattern with the new sample test cases that I got it to work. The new pattern failed only 1 of the non-compliant. $html =HTML a href=/sample/linkcontent/a a class=link href=/sample/link_extra_attribs title=sample linkcontent link_extra_attribs/a a href='/sample/link_single_quote'content link_single_quote/a a class='link' href='/sample/link_single_quote_pre_attribs'content link_single_quote_pre_attribs/a a class='link' href='/sample/link_single_quote_extra_attribs' title='sample link'content link_single_quote_extra_attribs/a a class='link' href='/sample/link_single_quote_extra_attribs_frag#fragment' title='sample link'content link_single_quote_extra_attribs_frag#fragment/a a class='link' href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment' title='sample link'content link_single_quote_extra_attribs_query_frag?par=val#fragment/a a href=/sample/link_double_quotecontent link_double_quote/a a class=link href=/sample/link_double_quote_pre_attribscontent link_double_quote_pre_attribs/a a class=link href=/sample/link_double_quote_extra_attribs_frag#fragment title=sample linkcontent link_double_quote_extra_attribs_frag#fragment/a a class=link href=/sample/link_double_quote_extra_attribs_nested_tag title=sample linkimg class=image src=/images/content.jpg alt=content title=content link_double_quote_extra_attribs_nested_tag/a a href=#fragmentcontent fragment/a a class=link href=#fragment title=sample linkcontent fragment/a li class=small tab a class=y-mast-link images href=http://images.search.yahoo.com/images; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span/a/li li class=small tab a class=y-mast-link video href=http://video.search.yahoo.com/video; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span/a/li li class=small tab a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a/li li class=small tab a class=y-mast-link shopping href=http://shopping.yahoo.com/search; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span/a/li li class=small lasttab more-tab a class=y-mast-link more href=http://tools.search.yahoo.com/about/forsearchers.html;span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span/a/li HTML; $pattern = '%a[\s]+[^]*?href\s*=\s*[\']?([^\'#]*)[\']?\s?[^]*(.*?)/a%ims'; preg_match_all($pattern, $html, $matches); Thanks for your time, Tommy Hi Tommy, This is why you shouldn't mix regexes and HTML/XML, especially when you are not sure that you are working with valid/consistent html. A great/fun answer has been posted on StackOverflow about this at http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 You could easily break any regular expressions solution by adding some valid comments, see example here : http://stackoverflow.com/questions/1357357/regexp-to-add-attribute-in-any-xml-tags/1357393#1357393 You really should consider using a XML parser instead for this kind of job. Here is a simple sample that matches your example : ?php $oTidy = new tidy(); $html = $oTidy-repairString($html,array(clean = true, drop-proprietary-attributes = true)); unset($oTidy); $matches = get_links($html); function get_links($html) { // Create a new DOM Document to hold our webpage structure $xml = new DOMDocument(); // Load the url's contents into the DOM $xml-loadHTML($html); // Empty array to hold all links to return $links = array(); //Loop through each a tag in the dom and add it to the link array foreach($xml-getElementsByTagName('a') as $link) { $links[] = array('url' = $link-getAttribute('href'), 'text' = $link-nodeValue); } //Return the links return $links; } ? Regards, Yann -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] Regex pattern for preg_match_all
Hi folks, This is not directly relating to PHP but it's Friday so I'm gonna give it a shot :). Would someone please help me figure out why my regex pattern doesn't work. Below is the code and sample data: $html = HTML li class=small tab a class=y-mast-link images href=http://images.search.yahoo.com/images; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span/a/li li class=small tab a class=y-mast-link video href=http://video.search.yahoo.com/video; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span/a/li li class=small tab a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a/li li class=small tab a class=y-mast-link shopping href=http://shopping.yahoo.com/search; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span/a/li li class=small lasttab more-tab a class=y-mast-link more href=http://tools.search.yahoo.com/about/forsearchers.html; span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span/a/li HTML; $pattern = '%a\s[^href]*href\s*=\s*[\'|]?([^\'||#]+)[\'|]?\s*[^]*(.*)?/a%im'; preg_match_all($pattern, $html, $matches); The only matches I got is: Match 1 of 1: a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a Group 1:http://local.yahoo.com/results Group 2:span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span The pattern I made was to work in cases where the page is non-compliant to any of standard W3. Thanks, Tommy -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Regex pattern for preg_match_all
As far as I can tell, your problem lies in [^href]*. That will match any characters other than h, r, e or f, not anything other than the string href. Consider replacing it with [^]*?. The ? makes it non-greedy so it will stop as soon as it can (when it matches the first href) rather than as late as it can (when it matches a ) --- Simon Welsh Sent from my phone, excuse the brevity On 19/02/2011, at 10:36, Tommy Pham tommy...@gmail.com wrote: Hi folks, This is not directly relating to PHP but it's Friday so I'm gonna give it a shot :). Would someone please help me figure out why my regex pattern doesn't work. Below is the code and sample data: $html = HTML li class=small tab a class=y-mast-link images href=http://images.search.yahoo.com/images; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span/a/li li class=small tab a class=y-mast-link video href=http://video.search.yahoo.com/video; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span/a/li li class=small tab a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a/li li class=small tab a class=y-mast-link shopping href=http://shopping.yahoo.com/search; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span/a/li li class=small lasttab more-tab a class=y-mast-link more href=http://tools.search.yahoo.com/about/forsearchers.html; span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span/a/li HTML; $pattern = '%a\s[^href]*href\s*=\s*[\'|]?([^\'||#]+)[\'|]?\s*[^]*(.*)?/a%im'; preg_match_all($pattern, $html, $matches); The only matches I got is: Match 1 of 1:a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a Group 1:http://local.yahoo.com/results Group 2:span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span The pattern I made was to work in cases where the page is non-compliant to any of standard W3. Thanks, Tommy -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Regex pattern for preg_match_all
On 18 February 2011 22:36, Tommy Pham tommy...@gmail.com wrote: Hi folks, This is not directly relating to PHP but it's Friday so I'm gonna give it a shot :). Would someone please help me figure out why my regex pattern doesn't work. Below is the code and sample data: $html = HTML li class=small tab a class=y-mast-link images href=http://images.search.yahoo.com/images; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span/a/li li class=small tab a class=y-mast-link video href=http://video.search.yahoo.com/video; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span/a/li li class=small tab a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a/li li class=small tab a class=y-mast-link shopping href=http://shopping.yahoo.com/search; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span/a/li li class=small lasttab more-tab a class=y-mast-link more href=http://tools.search.yahoo.com/about/forsearchers.html; span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span/a/li HTML; $pattern = '%a\s[^href]*href\s*=\s*[\'|]?([^\'||#]+)[\'|]?\s*[^]*(.*)?/a%im'; preg_match_all($pattern, $html, $matches); The only matches I got is: Match 1 of 1: a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a Group 1: http://local.yahoo.com/results Group 2: span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span The pattern I made was to work in cases where the page is non-compliant to any of standard W3. Not entirely sure what your input data is, as I'm guessing one or more mail programs may have added line breaks. When I run the code I get no matches at all - so I'm guessing you might have different input on your end. More specifically, I'm also guessing you have line breaks on your end, but not equally distributed - which would explain the one hit. Apart from that, there are a couple of things I'd rework in your regex: %a\s+.*?(?!href)\s+href\s*=\s*([^\s\']+|\'[^\']+\'|\[^\]+\)[^]*(.*?)/a%ims * added modifier to whitespace at first * allowing for any character not followed by href (non-greedy) * match the href * use proper alternation * capture anything inside the a tag, non-greedy * match with a closing /a tag Results: array(3) { [0]= array(5) { [0]= string(205) a class=y-mast-link images href=http://images.search.yahoo.com/images; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span/a [1]= string(201) a class=y-mast-link video href=http://video.search.yahoo.com/video; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span/a [2]= string(196) a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a [3]= string(204) a class=y-mast-link shopping href=http://shopping.yahoo.com/search; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span/a [4]= string(188) a class=y-mast-link more href=http://tools.search.yahoo.com/about/forsearchers.html; span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span/a } [1]= array(5) { [0]= string(39) http://images.search.yahoo.com/images; [1]= string(37) http://video.search.yahoo.com/video; [2]= string(32) http://local.yahoo.com/results; [3]= string(34) http://shopping.yahoo.com/search; [4]= string(55) http://tools.search.yahoo.com/about/forsearchers.html; } [2]= array(5) { [0]= string(96) span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span [1]= string(95) span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span [2]= string(95) span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span [3]= string(98) span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span [4]= string(94) span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span } -- hype WWW: plphp.dk / plind.dk LinkedIn: plind BeWelcome/Couchsurfing: Fake51 Twitter: kafe15 /hype -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Regex pattern for preg_match_all
@Simon, Thanks for explaining about the [^href]. I need to read up more about greediness. I thought I understood it but guess not. @Peter, I tried your pattern but it didn't capture all of my new test cases. Also, it captures the single/double quotes in addition to the fragments inside the href. I couldn't figure out how to modify your pattern to exclude the ', , and URL fragment from group 1 matches. Below is the new pattern with the new sample test cases that I got it to work. The new pattern failed only 1 of the non-compliant. $html = HTML a href=/sample/linkcontent/a a class=link href=/sample/link_extra_attribs title=sample linkcontent link_extra_attribs/a a href='/sample/link_single_quote'content link_single_quote/a a class='link' href='/sample/link_single_quote_pre_attribs'content link_single_quote_pre_attribs/a a class='link' href='/sample/link_single_quote_extra_attribs' title='sample link'content link_single_quote_extra_attribs/a a class='link' href='/sample/link_single_quote_extra_attribs_frag#fragment' title='sample link'content link_single_quote_extra_attribs_frag#fragment/a a class='link' href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment' title='sample link'content link_single_quote_extra_attribs_query_frag?par=val#fragment/a a href=/sample/link_double_quotecontent link_double_quote/a a class=link href=/sample/link_double_quote_pre_attribscontent link_double_quote_pre_attribs/a a class=link href=/sample/link_double_quote_extra_attribs_frag#fragment title=sample linkcontent link_double_quote_extra_attribs_frag#fragment/a a class=link href=/sample/link_double_quote_extra_attribs_nested_tag title=sample linkimg class=image src=/images/content.jpg alt=content title=content link_double_quote_extra_attribs_nested_tag/a a href=#fragmentcontent fragment/a a class=link href=#fragment title=sample linkcontent fragment/a li class=small tab a class=y-mast-link images href=http://images.search.yahoo.com/images; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Images/span/a/li li class=small tab a class=y-mast-link video href=http://video.search.yahoo.com/video; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Video/span/a/li li class=small tab a class=y-mast-link local href=http://local.yahoo.com/results; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Local/span/a/li li class=small tab a class=y-mast-link shopping href=http://shopping.yahoo.com/search; data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide style=padding-left:0em;padding-right:0em;Shopping/span/a/li li class=small lasttab more-tab a class=y-mast-link more href=http://tools.search.yahoo.com/about/forsearchers.html; span class=tab-cover y-mast-bg-hideMore/spanspan class=y-fp-pg-controls arrow/span/a/li HTML; $pattern = '%a[\s]+[^]*?href\s*=\s*[\']?([^\'#]*)[\']?\s?[^]*(.*?)/a%ims'; preg_match_all($pattern, $html, $matches); Thanks for your time, Tommy -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php