Hi Neil, > Please send a patch with whatever come up with, so others can make use > of it. I've already added Data.HTML.TagSoup.Tree to the latest darcs > version, which does as well as it can with tag matching, but is > entirely strict. Having a lazy version would be great.
It's too early for a new release, testing, especially performance testing, is not yet none, but the first version is in the darcs repository "http://darcs.fh-wedel.de/hxt/" (the version number is still 7.4) Those, who urgently need a more lasy XML parser, may try that one. Usage: call readDocument as usual, but with an extra option: readDocument [..., (a_tagsoup, "1")] > I've been talking to the Java tagsoup author (http://tagsoup.info), > which does very clever processing of HTML to make it as structured and > normalised as possible. He said: > > > The schema that describes HTML can be found at > > src/definitions/html.tssl in the source archive; I'll be glad to explain > > any obscurities in it. > > There is also some slides on his website (at the bottom) which detail > the Java TagSoup approach to reconstructing HTML, and have obviously > had a lot of thought put into them! I will have a look into that. Currently the strategy to repair lousy HTML is the same as in the parsec HTML parser and that's equivalent to what is done in HaXML. Uwe _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe