well if there is a lot of scraping involved, you better off using specialised software like screen-scraper.com - cuts development time at least in half on average, though it costs $ and a lot of resources.
The problem with screen scraping and using SimpleXml or XSLT, is that not always contents of a tag corresponds to a data field you are after, and these tools are not really designed for dealing with plain text. Consider this: <h1>Product: Dell Vostro 1510<h1>. It would be a bit too ungainly to extract manufacturer name, product name and model using XSLT. Ken ----- Original Message ----- From: "ctx2002" <[email protected]> To: "NZ PHP Users Group" <[email protected]> Sent: Thursday, July 16, 2009 4:32 PM Subject: [phpug] regex and scraping html page > > Jochen was posted a question about use regex to extract information > from HTML page. > > as every one can see, the regex is not easy to read and understand. > > I was thinking why not use xslt to process HTML file? PHP 5 has good > support for xslt processor. > > only extra step we need is to use HTML tidy program to make HTML page > "xml well form". > > for me, xsl file is easier to understand then regex expression. > > are there other way/tools to extra information from HTML without use > regex? > > > > > --~--~---------~--~----~------------~-------~--~----~ NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected] -~----------~----~----~----~------~----~------~--~---
