Hello David,

> I tried to use HXT's  readDocument with its  tagsoup option for my
> application.   I couldn't find a way to construct the operation that
> didn't run out of memory.   I'll attach some code using HaXml's
> saxParse so you can see what I want.   Is that easy to do in HXT?
> I simply want the text of <PMID> and <AbstractText> elements.

here's an example, that reads the input
in a lazy way. I ran this in ghci with a
file containing 2^20 XML Elements. The file
was about 18Mb in size.
A normal parse with the standard parsec parser
ran out of memory on my 1Gb box. This
one used within ghci about 200Mb max.

------------------------------------

module Main where

import Text.XML.HXT.Arrow
import System

main
    = do

      mapM_ main' names

main
    = do
      (name:_) <- getArgs
      runX ( readDoc name
             >>>
             fromLA (deep (hasName "PIMD"          -- select the nodes
                           <+>
                           hasName "AbstractText"
                          )
                     >>>
                     getChildren                   -- get the text
                     >>>
                     getText
                    )
             >>>
             arrIO putStrLn
           )
      putStrLn "main finished"

readDoc
    = readDocument [ (a_tagsoup, v_1)
                   , (a_parse_xml, v_1)
                   , (a_remove_whitespace, v_1)
                   , (a_encoding, isoLatin1)
                   , (a_issue_warnings, v_0)
                   , (a_trace, "1")
                   ]

---------------------
    

Cheers,

  Uwe Schmidt

-- 

Uwe Schmidt
Web: http://www.fh-wedel.de/~si/
_______________________________________________
Haskell-Cafe mailing list
Haskell-Cafe@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe

Reply via email to