[phpug] Re: regex and scraping html page

Ken Golovin Wed, 15 Jul 2009 21:53:09 -0700

well if there is a lot of scraping involved, you better off using 
specialised software like screen-scraper.com - cuts development time at 
least in half on average, though it costs $ and a lot of resources.


The problem with screen scraping  and using SimpleXml or XSLT, is that not 
always contents of a tag corresponds to a data field you are after, and 
these tools are not really designed for dealing with plain text. Consider 
this: <h1>Product: Dell Vostro 1510<h1>. It would be a bit too ungainly to 
extract manufacturer name, product name and model using XSLT.

Ken
----- Original Message ----- 
From: "ctx2002" <[email protected]>
To: "NZ PHP Users Group" <[email protected]>
Sent: Thursday, July 16, 2009 4:32 PM
Subject: [phpug] regex and scraping html page


>
> Jochen was posted a question about use regex to extract information
> from HTML page.
>
> as every one can see, the regex is not easy to read and understand.
>
> I was thinking why not use xslt to process HTML file? PHP 5 has good
> support for xslt processor.
>
> only extra step we need is to use HTML tidy program to make HTML page
> "xml well form".
>
> for me, xsl file is easier to understand then regex expression.
>
> are there other way/tools to extra information from HTML without use
> regex?
>
>
> >
> 



--~--~---------~--~----~------------~-------~--~----~
NZ PHP Users Group: http://groups.google.com/group/nzphpug
To post, send email to [email protected]
To unsubscribe, send email to
[email protected]
-~----------~----~----~----~------~----~------~--~---

[phpug] Re: regex and scraping html page

Reply via email to