Chris Angus <[EMAIL PROTECTED]> writes:
> I was wondering if anyone had thought of making a Sax-like
> interface based on lazy evaluation. where tokens are
> processed and taken from a (potentially) infinite stream
Sure. While barely able to follow discussions about monadic parser
combinators, and with a background in numerical analysis, I went ahead
and rolled my own non-validating XML parser nonetheless. (I needed to
rip some information from web pages, and this seemed like the most fun
way to do it :-)
Basically, I abandoned doing it the traditional way (returning a list
of parses, along with the rest of the stream), since I didn't get
the result lazy enough, and it's probably inefficient for other
reasons (e.g. the monads vs arrows paper by John Hughes).
What I ended up with, was three layers: a tokenizer that takes a
character stream and turns it into tokens (i.e. STAGO, TAGC [SGML
lingo], and so on, or simple characters). Then, a "tagizer" that
recognizes the different types of tags (i.e STAG, ETAG, comments,
CDATA sections, whathaveyou) - and that's basically where SAX is,
IIRC. And then a top layer, providing elements. Here's my parser:
readXML = xmlize . tagize . tokenize
:-) All the steps are as lazy as I can get them, so that
readXML "<xml><foo>bar"
will return Element "xml" [ Element "foo" [PCDATA "bar" before giving
an error. (Again IIRC, it's been shelved for a while now.)
Anyway, for HTML, which is invariably produced with hair-raisingly
broken tools, I had to substitute and interleave other functions, so
it's not quite as clean. But it worked out all right for my purposes.
(It's probably not general/beautiful/complete enough to be part of a
real library, but if you (or anybody) wants the code, drop me a mail)
-kzm
--
If I haven't seen further, it is by standing in the footprints of giants