[phpug] Re: regex and scraping html page

Matias Gertel Thu, 16 Jul 2009 19:11:16 -0700

I use the built in PHP DOM library to convert urls to "emailable". I  
fetch a URL, put it into a DOM structure and then do some magic with  
all hrefs, src, etc.
DOM is slow, but it's much easier to manipulate than regular  
expressions. Still I agree there are lots of cases where a regexp will  
be much more suitable for the the job than DOM manipulation.
Regarding valid html, you can set $xmldoc->strictErrorChecking =  
false; ($xmldoc is your DOM object) and it will parse most html,  
although I wouldn't trust it 100%.


Matias Gertel
Freelance Web Development & Coding
e: [email protected]
m: +64 21 288 8840
p: +64 9 838 3367

On 16/07/2009, at 4:38 PM, Boyd wrote:


DOM (http://nz.php.net/manual/en/book.dom.php) and everything else
that comes under XML Manipulation (http://nz.php.net/manual/en/
refs.xml.php)

On Jul 16, 4:32 pm, ctx2002 <[email protected]> wrote:
> Jochen was posted a question about use regex to extract information
> from HTML page.
>
> as every one can see, the regex is not easy to read and understand.
>
> I was thinking why not use xslt to process HTML file? PHP 5 has good
> support for xslt processor.
>
> only extra step we need is to use HTML tidy program to make HTML page
> "xml well form".
>
> for me, xsl file is easier to understand then regex expression.
>
> are there other way/tools to extra information from HTML without use
> regex?



--~--~---------~--~----~------------~-------~--~----~
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]
-~----------~----~----~----~------~----~------~--~---

[phpug] Re: regex and scraping html page

Reply via email to