Chris Angus <[EMAIL PROTECTED]> writes:

> I was wondering if anyone had thought of making a Sax-like 
> interface based on lazy evaluation. where tokens are
> processed and taken from a (potentially) infinite stream 

Sure.  While barely able to follow discussions about monadic parser
combinators, and with a background in numerical analysis, I went ahead 
and rolled my own non-validating XML parser nonetheless. (I needed to
rip some information from web pages, and this seemed like the most fun
way to do it :-)

Basically, I abandoned doing it the traditional way (returning a list
of parses, along with the rest of the stream), since I didn't get
the result lazy enough, and it's probably inefficient for other
reasons (e.g. the monads vs arrows paper by John Hughes). 

What I ended up with, was three layers:  a tokenizer that takes a
character stream and turns it into tokens (i.e. STAGO, TAGC [SGML
lingo], and so on, or simple characters).  Then, a "tagizer" that
recognizes the different types of tags (i.e STAG, ETAG, comments,
CDATA sections, whathaveyou) - and that's basically where SAX is,
IIRC.   And then a top layer, providing elements.  Here's my parser:

        readXML = xmlize . tagize . tokenize

:-)  All the steps are as lazy as I can get them, so that

        readXML "<xml><foo>bar"

will return Element "xml" [ Element "foo" [PCDATA "bar" before giving
an error.  (Again IIRC, it's been shelved for a while now.)

Anyway, for HTML, which is invariably produced with hair-raisingly
broken tools, I had to substitute and interleave other functions, so
it's not quite as clean.  But it worked out all right for my purposes.

(It's probably not general/beautiful/complete enough to be part of a
real library, but if you (or anybody) wants the code, drop me a mail)

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants

Reply via email to