Hello again :)

I'm looking at implemented options of scraping web pages? I've hit into 
this

http://re-factor.blogspot.nl/2014/04/scraping-re-factor.html

but that's a json output and I'm looking at pages that only have html. I 
see there's parse-html and scrape-html to parse a url into a vector, 
which seems like an html tree flattened to an (event) stream. I'm left 
to wonder about the choice as it is unusual to my eyes, but I found 
there's a bunch of words working with the output in 
html.parser.analyzer. I've fiddled around with it and found my way 
around to extract some components I was looking for.

So now I'm wondering - is there anything else I've missed. Is there 
something that parses html into a tree structure? Is there some simpler 
DSL to extract data? The common cases I hit into are XPath and CSS 
selectors, which are short and to the point, but I'm fine with w/e that 
is easy enough and has the same power. So basically I'm just looking for 
more tips or options in case I missed something. You guys have a lot of 
vocabs :)

-- 
------------
   Peter Nagy
------------

------------------------------------------------------------------------------
_______________________________________________
Factor-talk mailing list
Factor-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/factor-talk

Reply via email to