Re: [Haskell-cafe] hxt memory useage
Hi Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM. Remember, Haskell has lazy evaluation. TagSoup is basically a SAX approach, without the massive pain of the Java version and the API being back to front compared to what you want. If you assume that your XML is well formed, then I think TagSoup is already StAX as well. TagSoup has been carefully engineered to run in constant memory. I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist TagSoup has both no namespace support, and full namespace support - it will happily read tags with namespaces in them, and its a trivial break in your application to deal with them as you wish. Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
Suggestion: a binding to Expat, like perl and python did. So this is a request for an xml-light based on lazy bytestrings, designed for speed at all costs? -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
That's a great little job actually. I'll have a go! stevelihn: Suggestion: a binding to Expat, like perl and python did. So this is a request for an xml-light based on lazy bytestrings, designed for speed at all costs? -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
matthew.pocock: On Saturday 26 January 2008, Keith Fahlgren wrote: Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM. I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist Keith StAX is fine for a very wide range of applications, including web services. In the web-service domain, namespaces and entity expansion and xsd are not optional extras, but these can be layered on top of StAX rather than a more DOM-like structure. Just as a reality check, we regularly stream xml messages between web services in Java where the message bodies are many gig in size, using StAX, and neither the client or server need anything other than a constant memory overhead, as the portions of the message are generated and consumed in a streaming manner. It would be very nice if we could do similar things in haskell. Lazy evaluation FTW. :) -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
Don Stewart [EMAIL PROTECTED] writes: So this is a request for an xml-light based on lazy bytestrings, designed for speed at all costs? Yes, I suppose it is. (For certain values of all costs.) For industrial use, I think it is important to have better performance, ideally approaching disk and network speeds, and support large documents without excessive memory consumption. This probably means that strict, whole-document trees (DOM-like) must be optional. I think a good approach would be a TagSoup-like (SAX-like) lazy ByteString parser, with more advanced features (checking for well-formedness, building a tree structure, validation, namespace support..) layered on top. These days, there is a lot of XML around, so solid and performant XML processing could be another step in missing our stated mission goal of avoiding success at all costs. -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
On 1/26/08 3:43 AM, Ketil Malde wrote: I think a good approach would be a TagSoup-like (SAX-like) lazy ByteString parser, with more advanced features (checking for well-formedness, building a tree structure, validation, namespace support..) layered on top. Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM. I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist These days, there is a lot of XML around, so solid and performant XML processing could be another step in missing our stated mission goal of avoiding success at all costs. Agreed. Keith 1. http://stax.codehaus.org/ http://www.xml.com/pub/a/2003/09/17/stax.html ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
On Saturday 26 January 2008, Keith Fahlgren wrote: Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In either case, streaming, non-DOM. I am concerned by the number of people expressing willingness to abandon namespace support, but perhaps I'm being too much of a purist Keith StAX is fine for a very wide range of applications, including web services. In the web-service domain, namespaces and entity expansion and xsd are not optional extras, but these can be layered on top of StAX rather than a more DOM-like structure. Just as a reality check, we regularly stream xml messages between web services in Java where the message bodies are many gig in size, using StAX, and neither the client or server need anything other than a constant memory overhead, as the portions of the message are generated and consumed in a streaming manner. It would be very nice if we could do similar things in haskell. Matthew ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
Matthew Pocock [EMAIL PROTECTED] writes: I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. : Is this a known issue? Yes. I parse what I suppose are rather large XML files (the largest so far is 26GB), and ended up replacing HXT code with TagSoup. I also needed to use concurrency[1]. XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap. While I haven't tried HaXML, there is IMO a market opportunity for a fast and small XML library, and I'd happily trade away features like namespace support or arrows interfaces for that. -k [1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg31862.html -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
Hi One of the problems with XML parsing is nesting. Consider this fragment: foolots of text/foo The parser will naturally want to track all the way down to the closing /foo in order to check the document is well formed, so it can put it in a tree. The problem is that means keeping lots of text in memory - often the entire document. TagSoup works in a lazy streaming manner, so would parse the above as: [TagOpen foo [], TagText lots of text, TagClose foo] i.e. it hasn't matched the foo's, and can return the TagOpen before even looking at the text. XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap. I'd love TagSoup to go faster, while retaining its laziness. A basic profiling doesn't suggest anything obvious, but I may have missed something. It's more likely that it would be necessary to prod at the Core level, or move to supporting both (Lazy)ByteString and [Char]. Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
ketil+haskell: Matthew Pocock [EMAIL PROTECTED] writes: I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. : Is this a known issue? Yes. I parse what I suppose are rather large XML files (the largest so far is 26GB), and ended up replacing HXT code with TagSoup. I also needed to use concurrency[1]. XML parsing is still slow, typically consuming 90% of the CPU time, but at least it works without blowing the heap. While I haven't tried HaXML, there is IMO a market opportunity for a fast and small XML library, and I'd happily trade away features like namespace support or arrows interfaces for that. So this is a request for an xml-light based on lazy bytestrings, designed for speed at all costs? -- Don ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
Matthew Pocock wrote: I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle. Is this a known issue? Yes, hxt calls parsec, which is not incremental. haxml offers the choice of non-incremental parsers and incremental parsers. The incremental parsers offer finer control (and therefore also require finer control). ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] hxt memory useage
On Thursday 24 January 2008, Albert Y. C. Lai wrote: Matthew Pocock wrote: I've been using hxt to process xml files. Now that my files are getting a bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 8g on my box, and it's running out. As far as I can tell, this memory is getting used up while parsing the text, rather than in any down-stream processing by xpickle. Is this a known issue? Yes, hxt calls parsec, which is not incremental. haxml offers the choice of non-incremental parsers and incremental parsers. The incremental parsers offer finer control (and therefore also require finer control). I've got a load of code using xpickle, which taken together are quite an investment in hxt. Moving to haxml may not be very practical, as I'll have to find some eqivalent of xpickle for haxml and port thousands of lines of code over. Is there likely to be a low-cost solution to convincing hxt to be incremental that would get me out of this mess? Matthew ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe