Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Neil Mitchell
Hi

 Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In
 either case, streaming, non-DOM.

Remember, Haskell has lazy evaluation. TagSoup is basically a SAX
approach, without the massive pain of the Java version and the API
being back to front compared to what you want. If you assume that your
XML is well formed, then I think TagSoup is already StAX as well.
TagSoup has been carefully engineered to run in constant memory.

 I am concerned by the number of people expressing willingness to abandon
 namespace support, but perhaps I'm being too much of a purist

TagSoup has both no namespace support, and full namespace support - it
will happily read tags with namespaces in them, and its a trivial
break in your application to deal with them as you wish.

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Steve Lihn
Suggestion: a binding to Expat, like perl and python did.


 So this is a request for an xml-light based on lazy bytestrings, designed
 for speed at all costs?

 -- Don

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Don Stewart
That's a great little job actually. I'll have a go!

stevelihn:
 Suggestion: a binding to Expat, like perl and python did.
 
 
  So this is a request for an xml-light based on lazy bytestrings, designed
  for speed at all costs?
 
  -- Don
 
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Don Stewart
matthew.pocock:
 On Saturday 26 January 2008, Keith Fahlgren wrote:
  Perhaps a more modern approach would be StAX[1]-like rather than SAX-like?
  In either case, streaming, non-DOM.
 
  I am concerned by the number of people expressing willingness to abandon
  namespace support, but perhaps I'm being too much of a purist
 
  Keith
 
 StAX is fine for a very wide range of applications, including web services. 
 In 
 the web-service domain, namespaces and entity expansion and xsd are not 
 optional extras, but these can be layered on top of StAX rather than a more 
 DOM-like structure.
 
 Just as a reality check, we regularly stream xml messages between web 
 services 
 in Java where the message bodies are many gig in size, using StAX, and 
 neither the client or server need anything other than a constant memory 
 overhead, as the portions of the message are generated and consumed in a 
 streaming manner. It would be very nice if we could do similar things in 
 haskell.

Lazy evaluation FTW. :)

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-26 Thread Ketil Malde
Don Stewart [EMAIL PROTECTED] writes:

 So this is a request for an xml-light based on lazy bytestrings, designed
 for speed at all costs?

Yes, I suppose it is.  (For certain values of all costs.)

For industrial use, I think it is important to have better
performance, ideally approaching disk and network speeds, and support
large documents without excessive memory consumption.  This probably
means that strict, whole-document trees (DOM-like) must be optional.

I think a good approach would be a TagSoup-like (SAX-like) lazy
ByteString parser, with more advanced features (checking for
well-formedness, building a tree structure, validation, namespace
support..) layered on top. 

These days, there is a lot of XML around, so solid and performant XML
processing could be another step in missing our stated mission goal of
avoiding success at all costs.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-26 Thread Keith Fahlgren
On 1/26/08 3:43 AM, Ketil Malde wrote:
 I think a good approach would be a TagSoup-like (SAX-like) lazy
 ByteString parser, with more advanced features (checking for
 well-formedness, building a tree structure, validation, namespace
 support..) layered on top. 

Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In
either case, streaming, non-DOM.

I am concerned by the number of people expressing willingness to abandon
namespace support, but perhaps I'm being too much of a purist

 These days, there is a lot of XML around, so solid and performant XML
 processing could be another step in missing our stated mission goal of
 avoiding success at all costs.

Agreed.


Keith

1. http://stax.codehaus.org/
http://www.xml.com/pub/a/2003/09/17/stax.html
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-26 Thread Matthew Pocock
On Saturday 26 January 2008, Keith Fahlgren wrote:
 Perhaps a more modern approach would be StAX[1]-like rather than SAX-like?
 In either case, streaming, non-DOM.

 I am concerned by the number of people expressing willingness to abandon
 namespace support, but perhaps I'm being too much of a purist

 Keith

StAX is fine for a very wide range of applications, including web services. In 
the web-service domain, namespaces and entity expansion and xsd are not 
optional extras, but these can be layered on top of StAX rather than a more 
DOM-like structure.

Just as a reality check, we regularly stream xml messages between web services 
in Java where the message bodies are many gig in size, using StAX, and 
neither the client or server need anything other than a constant memory 
overhead, as the portions of the message are generated and consumed in a 
streaming manner. It would be very nice if we could do similar things in 
haskell.

Matthew
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-25 Thread Ketil Malde
Matthew Pocock [EMAIL PROTECTED] writes:

 I've been using hxt to process xml files. Now that my files are getting a bit 
 bigger (30m) I'm finding that hxt uses inordinate amounts of memory.
  :
 Is this a known issue?

Yes.  I parse what I suppose are rather large XML files (the largest
so far is 26GB), and ended up replacing HXT code with TagSoup.  I also
needed to use concurrency[1].  XML parsing is still slow, typically
consuming 90% of the CPU time, but at least it works without blowing
the heap. 

While I haven't tried HaXML, there is IMO a market opportunity for a
fast and small XML library, and I'd happily trade away features like
namespace support or arrows interfaces for that.

-k

[1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg31862.html
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-25 Thread Neil Mitchell
Hi

One of the problems with XML parsing is nesting. Consider this fragment:

foolots of text/foo

The parser will naturally want to track all the way down to the
closing /foo in order to check the document is well formed, so it
can put it in a tree. The problem is that means keeping lots of text
in memory - often the entire document. TagSoup works in a lazy
streaming manner, so would parse the above as:

[TagOpen foo [], TagText lots of text, TagClose foo]

i.e. it hasn't matched the foo's, and can return the TagOpen before
even looking at the text.

 XML parsing is still slow, typically
 consuming 90% of the CPU time, but at least it works without blowing
 the heap.

I'd love TagSoup to go faster, while retaining its laziness. A basic
profiling doesn't suggest anything obvious, but I may have missed
something. It's more likely that it would be necessary to prod at the
Core level, or move to supporting both (Lazy)ByteString and [Char].

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-25 Thread Don Stewart
ketil+haskell:
 Matthew Pocock [EMAIL PROTECTED] writes:
 
  I've been using hxt to process xml files. Now that my files are getting a 
  bit 
  bigger (30m) I'm finding that hxt uses inordinate amounts of memory.
   :
  Is this a known issue?
 
 Yes.  I parse what I suppose are rather large XML files (the largest
 so far is 26GB), and ended up replacing HXT code with TagSoup.  I also
 needed to use concurrency[1].  XML parsing is still slow, typically
 consuming 90% of the CPU time, but at least it works without blowing
 the heap. 
 
 While I haven't tried HaXML, there is IMO a market opportunity for a
 fast and small XML library, and I'd happily trade away features like
 namespace support or arrows interfaces for that.

So this is a request for an xml-light based on lazy bytestrings, designed
for speed at all costs?

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-24 Thread Albert Y. C. Lai

Matthew Pocock wrote:
I've been using hxt to process xml files. Now that my files are getting a bit 
bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 
8g on my box, and it's running out. As far as I can tell, this memory is 
getting used up while parsing the text, rather than in any down-stream 
processing by xpickle.


Is this a known issue?


Yes, hxt calls parsec, which is not incremental.

haxml offers the choice of non-incremental parsers and incremental 
parsers. The incremental parsers offer finer control (and therefore also 
require finer control).

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-24 Thread Matthew Pocock
On Thursday 24 January 2008, Albert Y. C. Lai wrote:
 Matthew Pocock wrote:
  I've been using hxt to process xml files. Now that my files are getting a
  bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory.
  I have 8g on my box, and it's running out. As far as I can tell, this
  memory is getting used up while parsing the text, rather than in any
  down-stream processing by xpickle.
 
  Is this a known issue?

 Yes, hxt calls parsec, which is not incremental.

 haxml offers the choice of non-incremental parsers and incremental
 parsers. The incremental parsers offer finer control (and therefore also
 require finer control).

I've got a load of code using xpickle, which taken together are quite an 
investment in hxt. Moving to haxml may not be very practical, as I'll have to 
find some eqivalent of xpickle for haxml and port thousands of lines of code 
over. Is there likely to be a low-cost solution to convincing hxt to be 
incremental that would get me out of this mess?

Matthew

 ___
 Haskell-Cafe mailing list
 Haskell-Cafe@haskell.org
 http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe