Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Don Stewart
matthew.pocock:
> On Saturday 26 January 2008, Keith Fahlgren wrote:
> > Perhaps a more modern approach would be StAX[1]-like rather than SAX-like?
> > In either case, streaming, non-DOM.
> >
> > I am concerned by the number of people expressing willingness to abandon
> > namespace support, but perhaps I'm being too much of a purist
> 
> > Keith
> 
> StAX is fine for a very wide range of applications, including web services. 
> In 
> the web-service domain, namespaces and entity expansion and xsd are not 
> optional extras, but these can be layered on top of StAX rather than a more 
> DOM-like structure.
> 
> Just as a reality check, we regularly stream xml messages between web 
> services 
> in Java where the message bodies are many gig in size, using StAX, and 
> neither the client or server need anything other than a constant memory 
> overhead, as the portions of the message are generated and consumed in a 
> streaming manner. It would be very nice if we could do similar things in 
> haskell.

Lazy evaluation FTW. :)

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Don Stewart
That's a great little job actually. I'll have a go!

stevelihn:
> Suggestion: a binding to Expat, like perl and python did.
> 
> >
> > So this is a request for an xml-light based on lazy bytestrings, designed
> > for speed at all costs?
> >
> > -- Don
> >
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Steve Lihn
Suggestion: a binding to Expat, like perl and python did.

>
> So this is a request for an xml-light based on lazy bytestrings, designed
> for speed at all costs?
>
> -- Don
>
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-27 Thread Neil Mitchell
Hi

> Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In
> either case, streaming, non-DOM.

Remember, Haskell has lazy evaluation. TagSoup is basically a SAX
approach, without the massive pain of the Java version and the API
being back to front compared to what you want. If you assume that your
XML is well formed, then I think TagSoup is already StAX as well.
TagSoup has been carefully engineered to run in constant memory.

> I am concerned by the number of people expressing willingness to abandon
> namespace support, but perhaps I'm being too much of a purist

TagSoup has both no namespace support, and full namespace support - it
will happily read tags with namespaces in them, and its a trivial
break in your application to "deal" with them as you wish.

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-26 Thread Matthew Pocock
On Saturday 26 January 2008, Keith Fahlgren wrote:
> Perhaps a more modern approach would be StAX[1]-like rather than SAX-like?
> In either case, streaming, non-DOM.
>
> I am concerned by the number of people expressing willingness to abandon
> namespace support, but perhaps I'm being too much of a purist

> Keith

StAX is fine for a very wide range of applications, including web services. In 
the web-service domain, namespaces and entity expansion and xsd are not 
optional extras, but these can be layered on top of StAX rather than a more 
DOM-like structure.

Just as a reality check, we regularly stream xml messages between web services 
in Java where the message bodies are many gig in size, using StAX, and 
neither the client or server need anything other than a constant memory 
overhead, as the portions of the message are generated and consumed in a 
streaming manner. It would be very nice if we could do similar things in 
haskell.

Matthew
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-26 Thread Keith Fahlgren
On 1/26/08 3:43 AM, Ketil Malde wrote:
> I think a good approach would be a TagSoup-like (SAX-like) lazy
> ByteString parser, with more advanced features (checking for
> well-formedness, building a tree structure, validation, namespace
> support..) layered on top. 

Perhaps a more modern approach would be StAX[1]-like rather than SAX-like? In
either case, streaming, non-DOM.

I am concerned by the number of people expressing willingness to abandon
namespace support, but perhaps I'm being too much of a purist

> These days, there is a lot of XML around, so solid and performant XML
> processing could be another step in missing our stated mission goal of
> avoiding success at all costs.

Agreed.


Keith

1. http://stax.codehaus.org/
http://www.xml.com/pub/a/2003/09/17/stax.html
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-26 Thread Ketil Malde
Don Stewart <[EMAIL PROTECTED]> writes:

> So this is a request for an xml-light based on lazy bytestrings, designed
> for speed at all costs?

Yes, I suppose it is.  (For certain values of "all costs".)

For "industrial" use, I think it is important to have better
performance, ideally approaching disk and network speeds, and support
large documents without excessive memory consumption.  This probably
means that strict, whole-document trees (DOM-like) must be optional.

I think a good approach would be a TagSoup-like (SAX-like) lazy
ByteString parser, with more advanced features (checking for
well-formedness, building a tree structure, validation, namespace
support..) layered on top. 

These days, there is a lot of XML around, so solid and performant XML
processing could be another step in missing our stated mission goal of
avoiding success at all costs.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-25 Thread Don Stewart
ketil+haskell:
> Matthew Pocock <[EMAIL PROTECTED]> writes:
> 
> > I've been using hxt to process xml files. Now that my files are getting a 
> > bit 
> > bigger (30m) I'm finding that hxt uses inordinate amounts of memory.
>   :
> > Is this a known issue?
> 
> Yes.  I parse what I suppose are rather large XML files (the largest
> so far is 26GB), and ended up replacing HXT code with TagSoup.  I also
> needed to use concurrency[1].  XML parsing is still slow, typically
> consuming 90% of the CPU time, but at least it works without blowing
> the heap. 
> 
> While I haven't tried HaXML, there is IMO a market opportunity for a
> fast and small XML library, and I'd happily trade away features like
> namespace support or arrows interfaces for that.

So this is a request for an xml-light based on lazy bytestrings, designed
for speed at all costs?

-- Don
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-25 Thread Neil Mitchell
Hi

One of the problems with XML parsing is nesting. Consider this fragment:

lots of text

The parser will naturally want to track all the way down to the
closing  in order to check the document is well formed, so it
can put it in a tree. The problem is that means keeping "lots of text"
in memory - often the entire document. TagSoup works in a lazy
streaming manner, so would parse the above as:

[TagOpen "foo" [], TagText "lots of text", TagClose "foo"]

i.e. it hasn't matched the foo's, and can return the TagOpen before
even looking at the text.

> XML parsing is still slow, typically
> consuming 90% of the CPU time, but at least it works without blowing
> the heap.

I'd love TagSoup to go faster, while retaining its laziness. A basic
profiling doesn't suggest anything obvious, but I may have missed
something. It's more likely that it would be necessary to prod at the
Core level, or move to supporting both (Lazy)ByteString and [Char].

Thanks

Neil
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-25 Thread Ketil Malde
Matthew Pocock <[EMAIL PROTECTED]> writes:

> I've been using hxt to process xml files. Now that my files are getting a bit 
> bigger (30m) I'm finding that hxt uses inordinate amounts of memory.
  :
> Is this a known issue?

Yes.  I parse what I suppose are rather large XML files (the largest
so far is 26GB), and ended up replacing HXT code with TagSoup.  I also
needed to use concurrency[1].  XML parsing is still slow, typically
consuming 90% of the CPU time, but at least it works without blowing
the heap. 

While I haven't tried HaXML, there is IMO a market opportunity for a
fast and small XML library, and I'd happily trade away features like
namespace support or arrows interfaces for that.

-k

[1] http://www.mail-archive.com/haskell-cafe@haskell.org/msg31862.html
-- 
If I haven't seen further, it is by standing in the footprints of giants
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-24 Thread Matthew Pocock
On Thursday 24 January 2008, Albert Y. C. Lai wrote:
> Matthew Pocock wrote:
> > I've been using hxt to process xml files. Now that my files are getting a
> > bit bigger (30m) I'm finding that hxt uses inordinate amounts of memory.
> > I have 8g on my box, and it's running out. As far as I can tell, this
> > memory is getting used up while parsing the text, rather than in any
> > down-stream processing by xpickle.
> >
> > Is this a known issue?
>
> Yes, hxt calls parsec, which is not incremental.
>
> haxml offers the choice of non-incremental parsers and incremental
> parsers. The incremental parsers offer finer control (and therefore also
> require finer control).

I've got a load of code using xpickle, which taken together are quite an 
investment in hxt. Moving to haxml may not be very practical, as I'll have to 
find some eqivalent of xpickle for haxml and port thousands of lines of code 
over. Is there likely to be a low-cost solution to convincing hxt to be 
incremental that would get me out of this mess?

Matthew

> ___
> Haskell-Cafe mailing list
> Haskell-Cafe@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe


___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] hxt memory useage

2008-01-24 Thread Albert Y. C. Lai

Matthew Pocock wrote:
I've been using hxt to process xml files. Now that my files are getting a bit 
bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 
8g on my box, and it's running out. As far as I can tell, this memory is 
getting used up while parsing the text, rather than in any down-stream 
processing by xpickle.


Is this a known issue?


Yes, hxt calls parsec, which is not incremental.

haxml offers the choice of non-incremental parsers and incremental 
parsers. The incremental parsers offer finer control (and therefore also 
require finer control).

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


[Haskell-cafe] hxt memory useage

2008-01-24 Thread Matthew Pocock
Hi,

I've been using hxt to process xml files. Now that my files are getting a bit 
bigger (30m) I'm finding that hxt uses inordinate amounts of memory. I have 
8g on my box, and it's running out. As far as I can tell, this memory is 
getting used up while parsing the text, rather than in any down-stream 
processing by xpickle.

Is this a known issue?

Matthew
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe