Re: [Haskell-cafe] Re: XML parser recommendation?
apfelmus wrote: > Ah! I got struck by a trick: it's possible to store every tag name in > full only once but still present a simple tree with full tag names to > the user. You simply share all the tag names. For instance, in > >let x = "mytagname" in Tree (Tag x) [Tree (Tag x) [Text "foobar"]] > > the string "mytagname" is stored in memory only once although there are > two tags with this name. > > When parsing an XML-file, you look up every parsed tag name in a finite > map (i.e. the trie). If it's already in, you don't store the parsed tag > name in the XML tree but the one stored at the leaf in the trie. Of > course, these two strings are equal. But they're not (pointer-) > identical! After parsing, all equal tag names now are pointers to one > and the same leaf in the finite map. You can drop the finite map > structure afterwards, the pointers to the leaves will remain. > > That would be quite the memory reduction for large XML trees which are > likely to have many many identical tag names. I've also thought about this approach. It sounds a bit weired, to built a map (or a trie) for the identity function. But it would solve a part of the space problem, at least when building XML documents by parsing. So I guess, there is a new project: A simple, small and lazy parser (no parsec), at least for the content part of XML. Uwe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: XML parser recommendation?
Uwe Schmidt wrote: The hashtable approach would of course reduce memory usage, Note that hashtables are evil :) I'm all for tries instead. but this would require a global change of the processing model: A document then does not longer consist of a single tree, it alway consists of a pair of a tree and a map. Ah! I got struck by a trick: it's possible to store every tag name in full only once but still present a simple tree with full tag names to the user. You simply share all the tag names. For instance, in let x = "mytagname" in Tree (Tag x) [Tree (Tag x) [Text "foobar"]] the string "mytagname" is stored in memory only once although there are two tags with this name. When parsing an XML-file, you look up every parsed tag name in a finite map (i.e. the trie). If it's already in, you don't store the parsed tag name in the XML tree but the one stored at the leaf in the trie. Of course, these two strings are equal. But they're not (pointer-) identical! After parsing, all equal tag names now are pointers to one and the same leaf in the finite map. You can drop the finite map structure afterwards, the pointers to the leaves will remain. That would be quite the memory reduction for large XML trees which are likely to have many many identical tag names. Regards, apfelmus ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: XML parser recommendation?
"Rene de Visser" <[EMAIL PROTECTED]> writes: > If I undertand the coding correctly every tag is stored as a seperate > Haskell string. As each byte of a string under GHC takes 12 bytes this alone > leads to high memory usage. Not that it detracts from your point, but I guess that is 24 bytes per character on 64 bit machinery? :-) -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: XML parser recommendation?
"Uwe Schmidt" <[EMAIL PROTECTED]> schrieb im Newsbeitrag news:[EMAIL PROTECTED] it into HXT. > > This still does not solve the processing of "very very large" > XML document. I doubt, whether we can do this with a DOM > like approach, as in HXT or HaXml. Lazy input does not solve all problems. > A SAX like parser could be a more useful choice for very large documents. > > Uwe I think a step towards support medium size documents in HXT would be to store the tags and content more efficiently. If I undertand the coding correctly every tag is stored as a seperate Haskell string. As each byte of a string under GHC takes 12 bytes this alone leads to high memory usage. Tags tend to repeat. You could store them uniquely using a hash table. Content could be stored in compressed byte strings. As I mentioned in an earlier post 2GB memory is not enough to process a 35MB XML document in HXT as we have 30 x 2 x 12 = 720 MB for starters to just store the string data (once in the parser and once in the DOM). (Well a machine with 2GB memory). I guess I had somewhere around 1GB free for the program. Other overheads most likely used up the ramaining 300 MB. Rene. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] Re: XML parser recommendation?
On 10/22/07, Rene de Visser <[EMAIL PROTECTED]> wrote: > > I had a look at using HXT awhile ago. Parsec is the least of the problems. > HXT stores the XML as an explicit tree in memory, where the head has > explict > references to the children. What did you end up using? I've started building an app based on HXT and that worries me. The fact that it produces a 10MB library file is also worrisome. Justin ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] Re: XML parser recommendation?
"Yitzchak Gale" <[EMAIL PROTECTED]> schrieb im Newsbeitrag news:[EMAIL PROTECTED] > Henning Thielemann wrote: >> HXT uses Parsec, which is strict. I had a look at using HXT awhile ago. Parsec is the least of the problems. HXT stores the XML as an explicit tree in memory, where the head has explict references to the children. This means that the whole XML tree is stored in memory until the last child is processed. Also this tree is stored ineffeciently. Everything as non shared Haskell strings. My experience is that a 30MB file (which is quite small for an XML file) can NOT be processed with 2GB memory. Rene. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe