On Mon, Dec 29, 2014 at 03:28:15AM +0100, mfv wrote: > So far, I have been getting the site with http-client, the raw html to sxml > with html-parser, and trying to process the resulting list with > matchable/srfi-13.
I would recommend avoiding that, as it can get really messy. sxpath is meant for this sort of thing, but unfortunately it's really difficult to use IMO. I somehow always manage to get it working with sxpath when I need to do some web scraping, but it's somewhat painful. > I am not sure how much good it will do to use regex on those > lists. You can't, in general. Neither would I recommend this, except perhaps when parsing the text content (and even then it might fail due to inline markup). > Are there any packages like Python's Beautifulsoup in the Chicken > arsenal? That sort of thing is sorely lacking. There's a promising "zipper" library written by Moritz Heidkamp, but so far it's unreleased and undocumented. If you're feeling very adventurous you could have a look at it: https://bitbucket.org/DerGuteMoritz/zipper There also used to be an sxml-match egg for CHICKEN 3, but nobody's bothered to port it to CHICKEN 4 so far. AFAIK its main advantage was that it was exactly like "matchable", but document order-insensitive for attribute nodes. > ; grab a website > (define lnk > ; "http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%291521-3773") > (define raw (with-input-from-request lnk #f read-string)) > > ;; convert site crawl data from html to sxml > (define sxml (html->sxml raw)) This can be done directly, without creating an intermediate large string, by using html->sxml on a port: (define sxml (call-with-input-request lnk #f html->sxml)) In fact, I didn't even know you could use html->sxml on a string. This seems to be an undocumented feature of html-parser :) Cheers, Peter -- http://www.more-magic.net _______________________________________________ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users