Re: [Haskell-cafe] XML parser recommendation?
"Yitzchak Gale" <[EMAIL PROTECTED]> wrote: > Another question about HaXML and HXT - > what is the level of XML spec. compliance? In HaXml, I certainly tried pretty hard to match the (draft) XML 1.0 spec, since the library was originally developed for a commercial entity. But that was back in 1999, and the spec has changed a little since then. You're right that it is a difficult target to hit though. The spec is unnecessarily complex. The major area of non-compliance in HaXml is that it doesn't do anything with input encodings. That is partly a limitation of Haskell implementations themselves, which have only recently gained libraries for the purpose. Otherwise, Dimitry Astapov has recently been sending me patches to fix some minor compliance problems, along with the QuickCheck properties that discovered them. His additions will appear probably in the next release of HaXml. Let me emphasise that these non-compliances are very minor. Larger issues that remain open do not fall into the compliance spectrum, but are more about useability in practice. For instance: effective support for external catalogs of references; techniques to handle XML namespaces in a sensible fashion, etc. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Rene de Visser wrote: > I think a step towards support medium size documents in HXT would be to > store the tags and content more efficiently. > If I undertand the coding correctly every tag is stored as a seperate > Haskell string. As each byte of a string under GHC takes 12 bytes this alone > leads to high memory usage. Tags tend to repeat. You could store them > uniquely using a hash table. Content could be stored in compressed byte > strings. Yes, storing element and attribute names in a packed format, something similar to ByteString but for unicode values, would reduce the amount of storage. A perhaps small shortcomming of that aproach are the conversions between String and the internal representation when processing names. The hashtable approach would of course reduce memory usage, but this would require a global change of the processing model: A document then does not longer consist of a single tree, it alway consists of a pair of a tree and a map. By the way, the amount of memory used for strings ([Char] values) in Haskell is a problem for ALL text processing tasks. Its not limited HXT, nor is it special to XML. For me the efficieny problems with strings as list of chars and a possible solution by e.g. implementing String data transparenty more efficent than other lists is an issue for Haskell' (or Haskell'') and/or it's a challage for the language implementors. Uwe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Ketil Malde <[EMAIL PROTECTED]> writes: > HaXml on my list after TagSoup, which I'm about to get to work, I > think (got distracted a bit ATM). As it is, I managed to parse my document using TagSoup. One major obstacle was the need to process a sizeable partition of the file. Using 'partitions' from TagSoup (which is implemented using the 'groupBy (const (not . p))' trick) didn't work, as it requires space proportional to the partition size. My solution (and please forgive me, it is getting late at night here) was to replace it with (slightly different semantics alert): breaks :: (a -> Bool) -> [a] -> [[a]] breaks p (x:xs) = let first = x : takeWhile (not.p) xs rest = dropWhile (not.p) xs in rest `par` first : if null rest then [] else breaks p rest I have no idea how reliable this is, and I suspect it isn't very, but on the plus side it does seems to work, at long as I compile with -smp. Parsing 300Mbytes of XML and outputting the information in 305K records takes approximately 5 minutes, and works with less than 1G of heap. This is fast and small enough for my purposes. Thanks for listening, and good night! -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Yitzchak Gale wrote: > Another question about HaXML and HXT - > what is the level of XML spec. compliance? Implementing the XML 1.0 Standard was one of the goals of HXT when starting the project. This includes full support of DTD processing, which turned out to be the hardest part of the whole parsing and validating stuff. Another goal was, to stay as near as posible to the XML spec, that meant, separate the tasks in a clean way and do it step by step: reading, decoding, parsing, substituting entiies, implementing the include mechanism wiht external references, validating and normalizing. In a second step we added a HTML parser, again with parsec, to be able to process none standard XML. Again we hadn't in mind to process "very large" XML documents. There is no technical reason of adding 3. parser (or a 4. one) accepting something like XML, perhaps without the DTD suff, which works lazily. The only reason not yet having done this was lack of time and manpower. So, dear Haskeller, feel free to participate to HXT by developping a lazy parser and we will integarte it into HXT. This still does not solve the processing of "very very large" XML document. I doubt, whether we can do this with a DOM like approach, as in HXT or HaXml. Lazy input does not solve all problems. A SAX like parser could be a more useful choice for very large documents. Uwe ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Malcolm Wallace wrote: > I have been considering moving the lexer to use > ByteString instead of String, which would neatly solve that problem too. Doesn't the lexer come only after decoding? Then you have Unicode. Does ByteString still help? -Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Another question about HaXML and HXT - what is the level of XML spec. compliance? The many who have tried to implement compliant XML parsers in various languages - and the few who have succeeded - all agree that this is much harder than it seems at first. Most of the time, the final result is an FFI call to expat. Thanks, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
"Yitzchak Gale" <[EMAIL PROTECTED]> wrote: > > In some usage patterns, it can reduce > > the cost of processing from linear in the size of the document, to a > > constant (the distance into the document to find a particular > > element). > > Oh oh - does that mean that Ketil's original case > (an element containing a large quantity of CDATA) could > still be a problem? Not necessarily. If the CDATA is not actually needed, it is possible that it would simply be discarded automatically by the lazy demand pattern. That does depend very much on how the consumer is written however. HaXml will still require a rather large amount of space to _lex_ the 25k line text element into a single token of course (I estimate no bigger than about 3Mb though). I have been considering moving the lexer to use ByteString instead of String, which would neatly solve that problem too. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
"Yitzchak Gale" <[EMAIL PROTECTED]> writes: >> In some usage patterns, it can reduce >> the cost of processing from linear in the size of the document, to a >> constant (the distance into the document to find a particular element). > Oh oh - does that mean that Ketil's original case > (an element containing a large quantity of CDATA) could > still be a problem? Don't know how interested you all are, but it's not just (P)CDATA, but a subsection (enclosed by a specific tag) that I need to process as a whole. HaXml on my list after TagSoup, which I'm about to get to work, I think (got distracted a bit ATM). -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Malcolm Wallace wrote: > HaXml now uses the polyparse library, and you can choose whether you > want well-formedness checking with the original strict parser, or lazy > space-efficient on-demand parsing. Initial performance results show > that parsing XML lazily is always better than 2x as fast, and 1/2x peak > memory usage of strict parsing. Hey, that's great news! > In some usage patterns, it can reduce > the cost of processing from linear in the size of the document, to a > constant (the distance into the document to find a particular element). Oh oh - does that mean that Ketil's original case (an element containing a large quantity of CDATA) could still be a problem? Parsing the CDATA ought to be possible to delay if needed. Thanks, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
"Yitzchak Gale" <[EMAIL PROTECTED]> wrote: > Henning Thielemann wrote: > > HXT uses Parsec, which is strict. > > Is is strict to the extent that it cannot produce any > output at all until it has read the entire XML document? > That would make HXT (and Parsec, for that matter) > useless for a large percentage of tasks. Yes, and yes. By contrast, the Utrecht parser combinator library gives "online" results, meaning that it delivers as much as it can without ambiguity. It is a bit like laziness, but it analyses the grammar to determine when it is safe to commit to a value, essentially once no error has been seen in a prefix of the input. And the polyparse library has several variations of properly lazy parsers, which only return results on demand (but there might be parse errors hidden inside the returned values, as exceptions). The user (grammar-writer) decides where the results should be lazy or strict. HaXml now uses the polyparse library, and you can choose whether you want well-formedness checking with the original strict parser, or lazy space-efficient on-demand parsing. Initial performance results show that parsing XML lazily is always better than 2x as fast, and 1/2x peak memory usage of strict parsing. In some usage patterns, it can reduce the cost of processing from linear in the size of the document, to a constant (the distance into the document to find a particular element). I have just made fresh releases of development versions of these libraries, for those who would like to experiment. http://www.cs.york.ac.uk/fp/polyparse http://www.cs.york.ac.uk/fp/HaXml-devel They are also available on hackage.haskell.org. Regards, Malcolm ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
On Mon, 22 Oct 2007, Yitzchak Gale wrote: > Henning Thielemann wrote: > > HXT uses Parsec, which is strict. > > Is is strict to the extent that it cannot produce any > output at all until it has read the entire XML document? Yep. You might like to check my XML wrapper library which interfaces to HaXML, HXT and TagSoup (providing the most lazy XML parser we have). It uses its own structure and cannot map all features of HaXML and HXT. But feel free to extend it as you need. I wrote it for HTML processing, thus it is maybe not optimal for other XML applications. http://darcs.haskell.org/wraxml/ ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Henning Thielemann wrote: > HXT uses Parsec, which is strict. Is is strict to the extent that it cannot produce any output at all until it has read the entire XML document? That would make HXT (and Parsec, for that matter) useless for a large percentage of tasks. Or is it just too strict in certain cases - like the case described by Ketil, where a certain element with a huge amount of CDATA content blows up the stack? That could probably be worked around. Thanks, Yitz ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
On Mon, 22 Oct 2007, Ketil Malde wrote: > I'm wondering what approach others use for non-toy XML data. Is the > problem due to some error I have made, or should I just ignore the > XML, and just parse it "manually" by dissecting bytestrings, or will > another XML library serve better? HXT uses Parsec, which is strict. ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
Re: [Haskell-cafe] XML parser recommendation?
Hi Ketil, > I'm struggling to get my HXT-based parser to parse a largish file > (<300MB), even after breaking into reasonably-sized chunks. The > culprit appears to be parsing one element comprising 25K lines of > text, which apparently requires more memory than the 2Gb my computer > is equipped with. You can try TagSoup (http://www-users.cs.york.ac.uk/~ndm/tagsoup/) which isn't really a complete XML parser, but may do what you want. The other option is HaXml which Malcolm has been adding lazy parsing to - I'm not sure if that is in a released variant or not. Thanks Neil ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe
[Haskell-cafe] XML parser recommendation?
Hi, I'm struggling to get my HXT-based parser to parse a largish file (<300MB), even after breaking into reasonably-sized chunks. The culprit appears to be parsing one element comprising 25K lines of text, which apparently requires more memory than the 2Gb my computer is equipped with. I'm wondering what approach others use for non-toy XML data. Is the problem due to some error I have made, or should I just ignore the XML, and just parse it "manually" by dissecting bytestrings, or will another XML library serve better? -k -- If I haven't seen further, it is by standing in the footprints of giants ___ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe