Hi, Am Freitag, den 14.05.2010, 15:31 -0300 schrieb Felipe Lessa: > On Fri, May 14, 2010 at 08:57:42AM -0700, John Millikin wrote: > > Additionally, since the original bytestring is shared in your types, > > potentially very large buffers could be locked in memory due to > > references held by only a small portion of the document. Chopping a > > document up into events or nodes creates some overhead due to the > > extra pointers, but allows unneeded portions to be freed. > > However, if your bytestring comes from mmap'ed memory this > drawback wouldn't apply :D.
exactly. Of course such a library would not be a general-purpose tool, but in cases where you know that you need most of the document for most of the time, e.g. when doing statistics on it, this would be acceptable. Also note that even after chopping into nodes, if you don’t make sure you drop the reference to root in a timely manner, the same thing would happen. Am Freitag, den 14.05.2010, 08:57 -0700 schrieb John Millikin: > The primary problem I see with this is that XML content is > fundamentally text, not bytes. Using your types, two XML documents > with identical content but different encodings will have different > Haskell values (and thus be incorrect regarding Eq, Ord, etc). The instances could be adapted... but this will be expensive, of course. One could also convert documents that are not utf-8 encoded as a whole and then work on that. > If you'd like memory-efficient text storage, using Bryan O'Sullivan's > "text" package[1] is probably the best option. It uses packed Word16 > buffers to store text as UTF-16. Probably not as efficient as a type > backed by UTF-8, but it's much much better than String. Right. For arbtt, I tried to switch from String to text, and it actually got slower. The reason (I think) was that besides passing strings around, it mainly runs pcre-light on them – which wants utf8-encoded bytestrings. I ended up creating a newtype¹ around utf8-encoded ByteStrings and the result was quite satisfying, both memory- and runtime-wise. I wish we had a package providing a standard type for this type that would become similarly popular. There is at least one more packages on hackage that defines this type: http://hackage.haskell.org/packages/archive/regex-tdfa-utf8/1.0/doc/html/Text-Regex-TDFA-UTF8.html Greetings, Joachim ¹ http://darcs.nomeata.de/arbtt/src/Data/MyText.hs -- Joachim Breitner e-Mail: m...@joachim-breitner.de Homepage: http://www.joachim-breitner.de ICQ#: 74513189 Jabber-ID: nome...@joachim-breitner.de
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe