Hi Joachim, I have been playing around with this idea myself in TagSoup (http://community.haskell.org/~ndm/tagsoup). The largest conceptual problem I came across was that TagSoup decodes entities (i.e. > becomes >). However, I think that's a minor issue, and entity resolution can be turned off in TagSoup to guarantee copy-free parsing.
The practical problem is that writing an HTML parser is hard, and that writing it in a way that works on both String/ByteString/LBS and gets the copy-free behaviour in the right places is tricky. I am still working on program optimisation techniques that I think will make this feasible (based around the idea of supercompilation, to expose the underlying structure of the parser, without writing it in too painful a manner). So with any luck your copy-free XML parser will happen sooner or later. Thanks, Neil On Fri, May 14, 2010 at 8:20 PM, Joachim Breitner <m...@joachim-breitner.de> wrote: > Hi, > > Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa: >> On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote: >> > Additionally, since the original bytestring is shared in your types, >> > potentially very large buffers could be locked in memory due to >> > references held by only a small portion of the document. Chopping a >> > document up into events or nodes creates some overhead due to the >> > extra pointers, but allows unneeded portions to be freed. >> >> However, if your bytestring comes from mmap'ed memory this >> drawback wouldn't apply :D. > > exactly. Of course such a library would not be a general-purpose tool, > but in cases where you know that you need most of the document for most > of the time, e.g. when doing statistics on it, this would be acceptable. > > Also note that even after chopping into nodes, if you don’t make sure > you drop the reference to root in a timely manner, the same thing would > happen. > > Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin: >> The primary problem I see with this is that XML content is >> fundamentally text, not bytes. Using your types, two XML documents >> with identical content but different encodings will have different >> Haskell values (and thus be incorrect regarding Eq, Ord, etc). > > The instances could be adapted... but this will be expensive, of course. > > One could also convert documents that are not utf-8 encoded as a whole > and then work on that. > >> If you'd like memory-efficient text storage, using Bryan O'Sullivan's >> "text" package[1] is probably the best option. It uses packed Word16 >> buffers to store text as UTF-16. Probably not as efficient as a type >> backed by UTF-8, but it's much much better than String. > > Right. For arbtt, I tried to switch from String to text, and it actually > got slower. The reason (I think) was that besides passing strings > around, it mainly runs pcre-light on them – which wants utf8-encoded > bytestrings. > > I ended up creating a newtype¹ around utf8-encoded ByteStrings and the > result was quite satisfying, both memory- and runtime-wise. I wish we > had a package providing a standard type for this type that would become > similarly popular. There is at least one more packages on hackage that > defines this type: > http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Text-Regex-TDFA-UTF8.html > > > Greetings, > Joachim > > ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs > > -- > Joachim Breitner > e-Mail: m...@joachim-breitner.de > Homepage: http://www.joachim-breitner.de > ICQ#: 74513189 > Jabber-ID: nome...@joachim-breitner.de > > _______________________________________________ > Haskell-Cafe mailing list > Haskell-Cafe@haskell.org > http://www.haskell.org/mailman/listinfo/haskell-cafe > > _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe