Re: Crawling existing bulletin boards, representing contents as SIOC?

seb Fri, 09 Oct 2009 07:31:18 -0700

I found ruby's Hpricot package to be quite handy with its ability to
query the HTML document via XPath expressions.
Not sure how well it handles bad markup though.


/seb

On 9 oct, 16:03, Paul A Houle <[email protected]> wrote:
> Sergio Fernández wrote:
>
> > There are many many technologies (TagSoup in Java, pyquery in python,
> > XSLT or many others...) that can be deployed adapting any current
> > crawler. But I don't know any packaged open-source product that fullfil
> > your requirements.
>
>     A general strategy I like is to run HTML through HTML Tidy,  
> converting it to XHTML.  Then you can use all kinds of XML tools,  such
> as XQuery,  XSLT,  or the DOM to do your parsing.  I've done this in
> both Java and PHP and I've had good results.  In one project (parsing
> all of Slashdot) bad HTML caused structural instability in the XHTML
> generated by Tidy,  but most of the time this approach works like a charm.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"SIOC-Dev" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/sioc-dev?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Crawling existing bulletin boards, representing contents as SIOC?

Reply via email to