Re: Crawling existing bulletin boards, representing contents as SIOC?

Paul A Houle Fri, 09 Oct 2009 07:04:03 -0700

Sergio Fernández wrote:
>
> There are many many technologies (TagSoup in Java, pyquery in python,
> XSLT or many others...) that can be deployed adapting any current
> crawler. But I don't know any packaged open-source product that fullfil
> your requirements.
>
>   
    A general strategy I like is to run HTML through HTML Tidy,  
converting it to XHTML.  Then you can use all kinds of XML tools,  such 
as XQuery,  XSLT,  or the DOM to do your parsing.  I've done this in 
both Java and PHP and I've had good results.  In one project (parsing 
all of Slashdot) bad HTML caused structural instability in the XHTML 
generated by Tidy,  but most of the time this approach works like a charm.



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Crawling existing bulletin boards, representing contents as SIOC?

Reply via email to