Re: [PHP] Regex pattern for preg_match_all

2011-02-22 Thread Yann Milin

Le 19/02/2011 0:23, Tommy Pham a écrit :

@Simon,

Thanks for explaining about the [^href].  I need to read up more about
greediness.  I thought I understood it but guess not.

@Peter,

I tried your pattern but it didn't capture all of my new test cases.
Also, it captures the single/double quotes in addition to the
fragments inside the href.  I couldn't figure out how to modify your
pattern to exclude the ', , and URL fragment from group 1 matches.

Below is the new pattern with the new sample test cases that I got it
to work.  The new pattern failed only 1 of the non-compliant.

$html =HTML
a href=/sample/linkcontent/a
a class=link href=/sample/link_extra_attribs title=sample
linkcontent link_extra_attribs/a
a href='/sample/link_single_quote'content link_single_quote/a
a class='link' href='/sample/link_single_quote_pre_attribs'content
link_single_quote_pre_attribs/a
a class='link' href='/sample/link_single_quote_extra_attribs'
title='sample link'content link_single_quote_extra_attribs/a
a class='link'
href='/sample/link_single_quote_extra_attribs_frag#fragment'
title='sample link'content
link_single_quote_extra_attribs_frag#fragment/a
a class='link'
href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment'
title='sample link'content
link_single_quote_extra_attribs_query_frag?par=val#fragment/a
a href=/sample/link_double_quotecontent link_double_quote/a
a class=link href=/sample/link_double_quote_pre_attribscontent
link_double_quote_pre_attribs/a
a class=link
href=/sample/link_double_quote_extra_attribs_frag#fragment
title=sample linkcontent
link_double_quote_extra_attribs_frag#fragment/a
a class=link
href=/sample/link_double_quote_extra_attribs_nested_tag
title=sample linkimg class=image src=/images/content.jpg
alt=content title=content
link_double_quote_extra_attribs_nested_tag/a
a href=#fragmentcontent fragment/a
a class=link href=#fragment title=sample linkcontent fragment/a
li class=small  tab a class=y-mast-link images
href=http://images.search.yahoo.com/images;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Images/span/a/li
li class=small  tab a class=y-mast-link video
href=http://video.search.yahoo.com/video;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Video/span/a/li
li class=small  tab a class=y-mast-link local
href=http://local.yahoo.com/results;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span/a/li
li class=small  tab a class=y-mast-link shopping
href=http://shopping.yahoo.com/search;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Shopping/span/a/li
li class=small lasttab more-tab a class=y-mast-link more
href=http://tools.search.yahoo.com/about/forsearchers.html;span
class=tab-cover y-mast-bg-hideMore/spanspan
class=y-fp-pg-controls arrow/span/a/li
HTML;

$pattern = 
'%a[\s]+[^]*?href\s*=\s*[\']?([^\'#]*)[\']?\s?[^]*(.*?)/a%ims';

preg_match_all($pattern, $html, $matches);

Thanks for your time,
Tommy


Hi Tommy,

This is why you shouldn't mix regexes and HTML/XML, especially when you 
are not sure that you are working with valid/consistent html.
A great/fun answer has been posted on StackOverflow about this at 
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454


You could easily break any regular expressions solution by adding some 
valid comments, see example here : 
http://stackoverflow.com/questions/1357357/regexp-to-add-attribute-in-any-xml-tags/1357393#1357393


You really should consider using a XML parser instead for this kind of job.

Here is a simple sample that matches your example :

?php
$oTidy = new tidy();
$html = $oTidy-repairString($html,array(clean = true, 
drop-proprietary-attributes = true));

unset($oTidy);

$matches = get_links($html);

function get_links($html) {

// Create a new DOM Document to hold our webpage structure
$xml = new DOMDocument();

// Load the url's contents into the DOM
$xml-loadHTML($html);

// Empty array to hold all links to return
$links = array();

//Loop through each a tag in the dom and add it to the link array
foreach($xml-getElementsByTagName('a') as $link) {
$links[] = array('url' = $link-getAttribute('href'), 'text' 
= $link-nodeValue);

}

//Return the links
return $links;
}
?

Regards,
Yann

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] Regex pattern for preg_match_all

2011-02-18 Thread Tommy Pham
Hi folks,

This is not directly relating to PHP but it's Friday so I'm gonna give
it a shot :).  Would someone please help me figure out why my regex
pattern doesn't work.  Below is the code and sample data:

$html = HTML
li class=small  tab a class=y-mast-link images
href=http://images.search.yahoo.com/images;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Images/span/a/li
li class=small  tab a class=y-mast-link video
href=http://video.search.yahoo.com/video;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Video/span/a/li
li class=small  tab a class=y-mast-link local
href=http://local.yahoo.com/results;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span/a/li
li class=small  tab a class=y-mast-link shopping
href=http://shopping.yahoo.com/search;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Shopping/span/a/li
li class=small lasttab more-tab a class=y-mast-link more
href=http://tools.search.yahoo.com/about/forsearchers.html; span
class=tab-cover y-mast-bg-hideMore/spanspan
class=y-fp-pg-controls arrow/span/a/li
HTML;

$pattern = 
'%a\s[^href]*href\s*=\s*[\'|]?([^\'||#]+)[\'|]?\s*[^]*(.*)?/a%im';
preg_match_all($pattern, $html, $matches);

The only matches I got is:

Match 1 of 1:   a class=y-mast-link local
href=http://local.yahoo.com/results;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span/a

Group 1:http://local.yahoo.com/results

Group 2:span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span

The pattern I made was to work in cases where the page is
non-compliant to any of standard W3.

Thanks,
Tommy

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Regex pattern for preg_match_all

2011-02-18 Thread Simon J Welsh
As far as I can tell, your problem lies in [^href]*. That will match any 
characters other than h, r, e or f, not anything other than the string href. 
Consider replacing it with [^]*?. The ? makes it non-greedy so it will stop as 
soon as it can (when it matches the first href) rather than as late as it can 
(when it matches a )
---
Simon Welsh
Sent from my phone, excuse the brevity

On 19/02/2011, at 10:36, Tommy Pham tommy...@gmail.com wrote:

 Hi folks,
 
 This is not directly relating to PHP but it's Friday so I'm gonna give
 it a shot :).  Would someone please help me figure out why my regex
 pattern doesn't work.  Below is the code and sample data:
 
 $html = HTML
 li class=small  tab a class=y-mast-link images
 href=http://images.search.yahoo.com/images;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Images/span/a/li
 li class=small  tab a class=y-mast-link video
 href=http://video.search.yahoo.com/video;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Video/span/a/li
 li class=small  tab a class=y-mast-link local
 href=http://local.yahoo.com/results;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Local/span/a/li
 li class=small  tab a class=y-mast-link shopping
 href=http://shopping.yahoo.com/search;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Shopping/span/a/li
 li class=small lasttab more-tab a class=y-mast-link more
 href=http://tools.search.yahoo.com/about/forsearchers.html; span
 class=tab-cover y-mast-bg-hideMore/spanspan
 class=y-fp-pg-controls arrow/span/a/li
 HTML;
 
 $pattern = 
 '%a\s[^href]*href\s*=\s*[\'|]?([^\'||#]+)[\'|]?\s*[^]*(.*)?/a%im';
 preg_match_all($pattern, $html, $matches);
 
 The only matches I got is:
 
 Match 1 of 1:a class=y-mast-link local
 href=http://local.yahoo.com/results;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Local/span/a
 
 Group 1:http://local.yahoo.com/results
 
 Group 2:span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Local/span
 
 The pattern I made was to work in cases where the page is
 non-compliant to any of standard W3.
 
 Thanks,
 Tommy
 
 -- 
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php
 


Re: [PHP] Regex pattern for preg_match_all

2011-02-18 Thread Peter Lind
On 18 February 2011 22:36, Tommy Pham tommy...@gmail.com wrote:
 Hi folks,

 This is not directly relating to PHP but it's Friday so I'm gonna give
 it a shot :).  Would someone please help me figure out why my regex
 pattern doesn't work.  Below is the code and sample data:

 $html = HTML
 li class=small  tab a class=y-mast-link images
 href=http://images.search.yahoo.com/images;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Images/span/a/li
 li class=small  tab a class=y-mast-link video
 href=http://video.search.yahoo.com/video;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Video/span/a/li
 li class=small  tab a class=y-mast-link local
 href=http://local.yahoo.com/results;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Local/span/a/li
 li class=small  tab a class=y-mast-link shopping
 href=http://shopping.yahoo.com/search;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Shopping/span/a/li
 li class=small lasttab more-tab a class=y-mast-link more
 href=http://tools.search.yahoo.com/about/forsearchers.html; span
 class=tab-cover y-mast-bg-hideMore/spanspan
 class=y-fp-pg-controls arrow/span/a/li
 HTML;

 $pattern = 
 '%a\s[^href]*href\s*=\s*[\'|]?([^\'||#]+)[\'|]?\s*[^]*(.*)?/a%im';
 preg_match_all($pattern, $html, $matches);

 The only matches I got is:

 Match 1 of 1:   a class=y-mast-link local
 href=http://local.yahoo.com/results;
 data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Local/span/a

 Group 1:        http://local.yahoo.com/results

 Group 2:        span class=tab-cover y-mast-bg-hide
 style=padding-left:0em;padding-right:0em;Local/span

 The pattern I made was to work in cases where the page is
 non-compliant to any of standard W3.


Not entirely sure what your input data is, as I'm guessing one or more
mail programs may have added line breaks. When I run the code I get no
matches at all - so I'm guessing you might have different input on
your end. More specifically, I'm also guessing you have line breaks on
your end, but not equally distributed - which would explain the one
hit.
 Apart from that, there are a couple of things I'd rework in your regex:

%a\s+.*?(?!href)\s+href\s*=\s*([^\s\']+|\'[^\']+\'|\[^\]+\)[^]*(.*?)/a%ims

* added modifier to whitespace at first
* allowing for any character not followed by href (non-greedy)
* match the href
* use proper alternation
* capture anything inside the a tag, non-greedy
* match with a closing /a tag

Results:
array(3) {
  [0]=
  array(5) {
[0]=
string(205) a class=y-mast-link images
href=http://images.search.yahoo.com/images;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Images/span/a
[1]=
string(201) a class=y-mast-link video
href=http://video.search.yahoo.com/video;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Video/span/a
[2]=
string(196) a class=y-mast-link local
href=http://local.yahoo.com/results;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span/a
[3]=
string(204) a class=y-mast-link shopping
href=http://shopping.yahoo.com/search;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Shopping/span/a
[4]=
string(188) a class=y-mast-link more
href=http://tools.search.yahoo.com/about/forsearchers.html; span
class=tab-cover y-mast-bg-hideMore/spanspan
class=y-fp-pg-controls arrow/span/a
  }
  [1]=
  array(5) {
[0]=
string(39) http://images.search.yahoo.com/images;
[1]=
string(37) http://video.search.yahoo.com/video;
[2]=
string(32) http://local.yahoo.com/results;
[3]=
string(34) http://shopping.yahoo.com/search;
[4]=
string(55) http://tools.search.yahoo.com/about/forsearchers.html;
  }
  [2]=
  array(5) {
[0]=
string(96) span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Images/span
[1]=
string(95) span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Video/span
[2]=
string(95) span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span
[3]=
string(98) span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Shopping/span
[4]=
string(94) span
class=tab-cover y-mast-bg-hideMore/spanspan
class=y-fp-pg-controls arrow/span
  }


-- 
hype
WWW: plphp.dk / plind.dk
LinkedIn: plind
BeWelcome/Couchsurfing: Fake51
Twitter: kafe15
/hype

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Regex pattern for preg_match_all

2011-02-18 Thread Tommy Pham
@Simon,

Thanks for explaining about the [^href].  I need to read up more about
greediness.  I thought I understood it but guess not.

@Peter,

I tried your pattern but it didn't capture all of my new test cases.
Also, it captures the single/double quotes in addition to the
fragments inside the href.  I couldn't figure out how to modify your
pattern to exclude the ', , and URL fragment from group 1 matches.

Below is the new pattern with the new sample test cases that I got it
to work.  The new pattern failed only 1 of the non-compliant.

$html = HTML
a href=/sample/linkcontent/a
a class=link href=/sample/link_extra_attribs title=sample
linkcontent link_extra_attribs/a
a href='/sample/link_single_quote'content link_single_quote/a
a class='link' href='/sample/link_single_quote_pre_attribs'content
link_single_quote_pre_attribs/a
a class='link' href='/sample/link_single_quote_extra_attribs'
title='sample link'content link_single_quote_extra_attribs/a
a class='link'
href='/sample/link_single_quote_extra_attribs_frag#fragment'
title='sample link'content
link_single_quote_extra_attribs_frag#fragment/a
a class='link'
href='/sample/link_single_quote_extra_attribs_query_frag?par=val#fragment'
title='sample link'content
link_single_quote_extra_attribs_query_frag?par=val#fragment/a
a href=/sample/link_double_quotecontent link_double_quote/a
a class=link href=/sample/link_double_quote_pre_attribscontent
link_double_quote_pre_attribs/a
a class=link
href=/sample/link_double_quote_extra_attribs_frag#fragment
title=sample linkcontent
link_double_quote_extra_attribs_frag#fragment/a
a class=link
href=/sample/link_double_quote_extra_attribs_nested_tag
title=sample linkimg class=image src=/images/content.jpg
alt=content title=content
link_double_quote_extra_attribs_nested_tag/a
a href=#fragmentcontent fragment/a
a class=link href=#fragment title=sample linkcontent fragment/a
li class=small  tab a class=y-mast-link images
href=http://images.search.yahoo.com/images;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Images/span/a/li
li class=small  tab a class=y-mast-link video
href=http://video.search.yahoo.com/video;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Video/span/a/li
li class=small  tab a class=y-mast-link local
href=http://local.yahoo.com/results;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Local/span/a/li
li class=small  tab a class=y-mast-link shopping
href=http://shopping.yahoo.com/search;
data-b=http://www.yahoo.com;span class=tab-cover y-mast-bg-hide
style=padding-left:0em;padding-right:0em;Shopping/span/a/li
li class=small lasttab more-tab a class=y-mast-link more
href=http://tools.search.yahoo.com/about/forsearchers.html; span
class=tab-cover y-mast-bg-hideMore/spanspan
class=y-fp-pg-controls arrow/span/a/li
HTML;

$pattern = 
'%a[\s]+[^]*?href\s*=\s*[\']?([^\'#]*)[\']?\s?[^]*(.*?)/a%ims';

preg_match_all($pattern, $html, $matches);

Thanks for your time,
Tommy

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php