Hi,
I've got some difficulties parsing large xml files ( 100MB).
A plain SAX parser, as provided by hexpat, is fine. However,
constructing a tree consumes too much memory on a 32bit machine.
see http://trac.informatik.uni-bremen.de:8080/hets/ticket/1248
I suspect that sharing strings when
Have you looked at tagsoup?
On Feb 20, 2014 3:30 AM, Christian Maeder christian.mae...@dfki.de
wrote:
Hi,
I've got some difficulties parsing large xml files ( 100MB).
A plain SAX parser, as provided by hexpat, is fine. However, constructing a
tree consumes too much memory on a 32bit machine.
I've just tried:
import Text.HTML.TagSoup
import Text.HTML.TagSoup.Tree
main :: IO ()
main = getContents = putStr . renderTags . flattenTree . tagTree .
parseTags
which also ends with the getMBlock error.
Only renderTags . parseTags works fine (like the hexpat SAX parser).
Why
Hi Christian,
as regards your question about sharing strings, there are a number of
libraries on Hackage to achieve this, e.g. in the context of compiler
symbols. To cite only a few: intern, stringtable-atom, simple-atom.
I'm sure there are others.
Best,
--
Mathieu Boespflug
Founder at
Is your usage pattern over the constructed tree likely to be a lazy prefix traversal? If so, then HaXml supports lazy construction of the parse tree. Some plotsappear at the end of this paper,showing how memory usage can be reduced to a constant, even for very large inputs (1 million tree
I'm afraid our use case is not a lazy prefix traversal.
I'm more shocked that about 100 MB xml content do not fit (as tree) into
3 GB memory.
Christian
Am 20.02.2014 16:49, schrieb malcolm.wallace:
Is your usage pattern over the constructed tree likely to be a lazy
prefix traversal? If so,
On 20/02/14 11:30, Christian Maeder wrote:
Hi,
I've got some difficulties parsing large xml files ( 100MB).
A plain SAX parser, as provided by hexpat, is fine. However,
constructing a tree consumes too much memory on a 32bit machine.
see