> How does it compare to the python parser (speed and memory usage)? > E.g. I have tried to convert a quite large document (44MB) into > Plucker format and ran out of memory when the parser required more > than 300MB ;-)
The raw size of the document to be converted is less interesting than the granularity of its structure. The parser creates objects for each paragraph, and for each URL, each of which is probably several hundred bytes in size, containing the parse tree of the contents of that URL. In addition, the text of each URL is of course kept. So if we had a 10MB HTML document, of which 5% was markup, and the rest text, and it in turn was composed of 100,000 individual URLs (100 bytes per URL), we'd need at least 300x100KB + 9,500KB + 100KB (for Plucker-style markup) = 39.6 MB. And my estimate of 300 bytes/object could well be off by a factor of two (in either direction, I should say). In addition, when the document is actually written, a second object with the externalized form (the Plucker DB record bytes) is created, which is probably another 300 bytes + content. All of these are kept in memory till all records have been converted. Only then, finally, is the memory released. So for a fine-grained 10MB document, you could see 80-100 MB of memory being used. For a fine-grained 44MB document, 300-400MB wouldn't surprise me at all. Things could be re-written in the fashion of a compiler, of course. You'd have one pass which would build parse trees, and a second which would build relocatable DB records, and a "linker" which would go through and and put the right record-IDs in the right places in the DB records, and produce the final output. Not sure it's worth the effort just now, though. Bill