Sergio Fernández wrote:
>
> There are many many technologies (TagSoup in Java, pyquery in python,
> XSLT or many others...) that can be deployed adapting any current
> crawler. But I don't know any packaged open-source product that fullfil
> your requirements.
>
>   
    A general strategy I like is to run HTML through HTML Tidy,  
converting it to XHTML.  Then you can use all kinds of XML tools,  such 
as XQuery,  XSLT,  or the DOM to do your parsing.  I've done this in 
both Java and PHP and I've had good results.  In one project (parsing 
all of Slashdot) bad HTML caused structural instability in the XHTML 
generated by Tidy,  but most of the time this approach works like a charm.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to