I use the built in PHP DOM library to convert urls to "emailable". I fetch a URL, put it into a DOM structure and then do some magic with all hrefs, src, etc. DOM is slow, but it's much easier to manipulate than regular expressions. Still I agree there are lots of cases where a regexp will be much more suitable for the the job than DOM manipulation. Regarding valid html, you can set $xmldoc->strictErrorChecking = false; ($xmldoc is your DOM object) and it will parse most html, although I wouldn't trust it 100%.
Matias Gertel Freelance Web Development & Coding e: [email protected] m: +64 21 288 8840 p: +64 9 838 3367 On 16/07/2009, at 4:38 PM, Boyd wrote: DOM (http://nz.php.net/manual/en/book.dom.php) and everything else that comes under XML Manipulation (http://nz.php.net/manual/en/ refs.xml.php) On Jul 16, 4:32 pm, ctx2002 <[email protected]> wrote: > Jochen was posted a question about use regex to extract information > from HTML page. > > as every one can see, the regex is not easy to read and understand. > > I was thinking why not use xslt to process HTML file? PHP 5 has good > support for xslt processor. > > only extra step we need is to use HTML tidy program to make HTML page > "xml well form". > > for me, xsl file is easier to understand then regex expression. > > are there other way/tools to extra information from HTML without use > regex? --~--~---------~--~----~------------~-------~--~----~ NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected] -~----------~----~----~----~------~----~------~--~---
