I found ruby's Hpricot package to be quite handy with its ability to query the HTML document via XPath expressions. Not sure how well it handles bad markup though.
/seb On 9 oct, 16:03, Paul A Houle <[email protected]> wrote: > Sergio Fernández wrote: > > > There are many many technologies (TagSoup in Java, pyquery in python, > > XSLT or many others...) that can be deployed adapting any current > > crawler. But I don't know any packaged open-source product that fullfil > > your requirements. > > A general strategy I like is to run HTML through HTML Tidy, > converting it to XHTML. Then you can use all kinds of XML tools, such > as XQuery, XSLT, or the DOM to do your parsing. I've done this in > both Java and PHP and I've had good results. In one project (parsing > all of Slashdot) bad HTML caused structural instability in the XHTML > generated by Tidy, but most of the time this approach works like a charm. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "SIOC-Dev" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/sioc-dev?hl=en -~----------~----~----~----~------~----~------~--~---
