Re: [Chicken-users] Parsing HTML, best practice with Chicken

Peter Bex Mon, 29 Dec 2014 03:58:11 -0800

On Mon, Dec 29, 2014 at 03:28:15AM +0100, mfv wrote:
> So far, I have been getting the site with http-client, the raw html to sxml
> with html-parser, and trying to process the resulting list with
> matchable/srfi-13.


I would recommend avoiding that, as it can get really messy.  sxpath is meant
for this sort of thing, but unfortunately it's really difficult to use IMO.

I somehow always manage to get it working with sxpath when I need to do
some web scraping, but it's somewhat painful.

> I am not sure how much good it will do to use regex on those
> lists.

You can't, in general.  Neither would I recommend this, except perhaps
when parsing the text content (and even then it might fail due to inline
markup).

>  Are there any packages like Python's Beautifulsoup in the Chicken
> arsenal?

That sort of thing is sorely lacking.  There's a promising "zipper"
library written by Moritz Heidkamp, but so far it's unreleased and
undocumented.  If you're feeling very adventurous you could have
a look at it: https://bitbucket.org/DerGuteMoritz/zipper

There also used to be an sxml-match egg for CHICKEN 3, but nobody's
bothered to port it to CHICKEN 4 so far.  AFAIK its main advantage was
that it was exactly like "matchable", but document order-insensitive for
attribute nodes.

> ; grab a website
> (define lnk
> ; "http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773";)
> (define raw (with-input-from-request lnk #f read-string))
> 
> ;; convert site crawl data from html to sxml
> (define sxml (html->sxml raw))

This can be done directly, without creating an intermediate
large string, by using html->sxml on a port:

(define sxml (call-with-input-request lnk #f html->sxml))

In fact, I didn't even know you could use html->sxml on a
string.  This seems to be an undocumented feature of html-parser :)

Cheers,
Peter
-- 
http://www.more-magic.net

_______________________________________________
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

Re: [Chicken-users] Parsing HTML, best practice with Chicken

Reply via email to