Sergio Fernández wrote:
>
> There are many many technologies (TagSoup in Java, pyquery in python,
> XSLT or many others...) that can be deployed adapting any current
> crawler. But I don't know any packaged open-source product that fullfil
> your requirements.
>
>
A general strategy I like is to run HTML through HTML Tidy,
converting it to XHTML. Then you can use all kinds of XML tools, such
as XQuery, XSLT, or the DOM to do your parsing. I've done this in
both Java and PHP and I've had good results. In one project (parsing
all of Slashdot) bad HTML caused structural instability in the XHTML
generated by Tidy, but most of the time this approach works like a charm.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---