Re: [Haskell-cafe] Re: hxt memory useage

2008-02-01 Thread Malcolm Wallace
"Rene de Visser" <[EMAIL PROTECTED]> wrote:

> Even if you replace parsec, HXT is itself not
> incremental.  (It stores the whole XML document in memory as a tree,
> and the tree is not  memory effecient.

If the usage pattern of the tree is search-and-discard, then only enough
of the tree to satisfy the search needs to be stored in memory at once.
Everything from the root to the first node of interest can easily be
pruned by the garbage collector.

A paper describing the lazy parsing technique, and using XML-parsing as
its motivating example, is available at
http://www.cs.york.ac.uk/~malcolm/partialparse.html

> >> haxml offers the choice of non-incremental parsers and incremental
> >> parsers.

Indeed.  This lazy incremental parser for XML is available in the
development version of HaXml:
http://www.cs.york.ac.uk/fp/HaXml-devel

The source code for partial parsing is available in a separate package:
http://www.cs.york.ac.uk/fp/polyparse

These lazy parser combinators are roughly between 2x - 5x faster than
Parsec on large inputs (although the strict variation is about 2x slower
than Parsec).

Regards,
Malcolm
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: hxt memory useage

2008-01-28 Thread Uwe Schmidt
Rene de Visser wrote:

> "Matthew Pocock" <[EMAIL PROTECTED]> schrieb im Newsbeitrag 
> news:[EMAIL PROTECTED]
> > On Thursday 24 January 2008, Albert Y. C. Lai wrote:
> >> Matthew Pocock wrote:
> >> > I've been using hxt to process xml files. Now that my files are getting 
> >> > a
> >> > bit bigger (30m) I'm finding that hxt uses inordinate amounts of 
> >> > memory.
> >> > I have 8g on my box, and it's running out. As far as I can tell, this
> >> > memory is getting used up while parsing the text, rather than in any
> >> > down-stream processing by xpickle.
> >> >
> >> > Is this a known issue?
> >>
> >> Yes, hxt calls parsec, which is not incremental.
> >>
> >> haxml offers the choice of non-incremental parsers and incremental
> >> parsers. The incremental parsers offer finer control (and therefore also
> >> require finer control).
> >
> > I've got a load of code using xpickle, which taken together are quite an
> > investment in hxt. Moving to haxml may not be very practical, as I'll have 
> > to
> > find some eqivalent of xpickle for haxml and port thousands of lines of 
> > code
> > over. Is there likely to be a low-cost solution to convincing hxt to be
> > incremental that would get me out of this mess?
> >
> > Matthew
> 
> I don't think so. Even if you replace parsec, HXT is itself not incremental. 
> (It stores the whole XML document in memory as a tree, and the tree is not 
> memory effecient.

this statement isn't true in general. HXT itself can be incremental, if there
is no need for traversing the whole XML tree. When processing a document
containing a DTD, indeed there is a need even when no validation is required,
for traversal because of the entity substitution.

Technically it's not a big deal to write a very simple and lasy parser, or to
take the tagsoup or haxml lasy parsers and adapt it to the hxt DOM structure.
Combining the parser with the ByteString lib raises a small problem,
the handling of Unicode chars, so there is a need for a lasy Word8 to Unicode 
(Char)
conversion, but that's already in HXT (thanks to Henning Thielemann).

So the problem is not a technical one, it's just a matter of time an resources.
If someone has such a lightweigt lasy xml parser, I will help to integrate it 
into
hxt.


  Uwe
___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe


Re: [Haskell-cafe] Re: hxt memory useage

2008-01-26 Thread Bulat Ziganshin
Hello Rene,

Friday, January 25, 2008, 10:49:53 PM, you wrote:

> Still I am a bit surprised that you can't parse 30m with 8 gig memory.

> This was discussed here before, and I think someone benchmarked HXT as using
> roughly 50 bytes of memory per 1 byte of input.
> i.e. HXT would then be using about 1.5 gig of memory for your 30m file.

to be exact, it's GHC by itself who use so many memory (on 64-bit
systems). the calculation is simple:
- one word (8 byte) for Char itself
- one word for tail pointer
- one word for laziness thunk

it's already 24 bytes. and depending on GC used, this amount increases
either 2x (with collecting GC) or 3x (with default, copying GC)

and HXT may add even more overhead, saving several copies of data or
building some lazy thunks on each char


-- 
Best regards,
 Bulatmailto:[EMAIL PROTECTED]

___
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe