Hi, > > well if there is a lot of scraping involved, you better off using > specialised software like screen-scraper.com - cuts development time at > least in half on average, though it costs $ and a lot of resources. > > The problem with screen scraping and using SimpleXml or XSLT, is that not > always contents of a tag corresponds to a data field you are after, and > these tools are not really designed for dealing with plain text. Consider > this: <h1>Product: Dell Vostro 1510<h1>. It would be a bit too ungainly to > extract manufacturer name, product name and model using XSLT. >
>> >> Jochen was posted a question about use regex to extract information >> from HTML page. >> >> as every one can see, the regex is not easy to read and understand. >> >> I was thinking why not use xslt to process HTML file? PHP 5 has good >> support for xslt processor. >> >> only extra step we need is to use HTML tidy program to make HTML page >> "xml well form". >> >> for me, xsl file is easier to understand then regex expression. >> >> are there other way/tools to extra information from HTML without use >> regex? >> >> I agree and would like to add: - I'm using regular expressions a lot to extend the Joomla backend interface, mainly because they have not reorganised it in a way where this can be done without parsing the code yet. These modules (so called plugins) are called quite a number of times and I'm not sure If I would like to invoke a XSL processor every time - In my work with Astute/TelstraClear we had to parse HTML where even some <td tags weren't really close. Sure it rendered ok in IE6 and that was all that TelstraClear cared about. Nevertheless the job needed to be done. Generally, if you're using regex to parse HTML you are already on a compromise path, after all if the data publishers cared, they would have an XML API. HTH, Jochen --~--~---------~--~----~------------~-------~--~----~ NZ PHP Users Group: http://groups.google.com/group/nzphpug To post, send email to [email protected] To unsubscribe, send email to [email protected] -~----------~----~----~----~------~----~------~--~---
