> How does it compare to the python parser (speed and memory usage)?
> E.g. I have tried to convert a quite large document (44MB) into
> Plucker format and ran out of memory when the parser required more
> than 300MB ;-)

The raw size of the document to be converted is less interesting than
the granularity of its structure.  The parser creates objects for each
paragraph, and for each URL, each of which is probably several hundred
bytes in size, containing the parse tree of the contents of that URL.
In addition, the text of each URL is of course kept.  So if we had a
10MB HTML document, of which 5% was markup, and the rest text, and it
in turn was composed of 100,000 individual URLs (100 bytes per URL),
we'd need at least 300x100KB + 9,500KB + 100KB (for Plucker-style
markup) = 39.6 MB.  And my estimate of 300 bytes/object could well be
off by a factor of two (in either direction, I should say).

In addition, when the document is actually written, a second object
with the externalized form (the Plucker DB record bytes) is created,
which is probably another 300 bytes + content.  All of these are kept
in memory till all records have been converted.  Only then, finally,
is the memory released.  So for a fine-grained 10MB document, you
could see 80-100 MB of memory being used.  For a fine-grained 44MB
document, 300-400MB wouldn't surprise me at all.

Things could be re-written in the fashion of a compiler, of course.
You'd have one pass which would build parse trees, and a second which
would build relocatable DB records, and a "linker" which would go
through and and put the right record-IDs in the right places in the DB
records, and produce the final output.  Not sure it's worth the effort
just now, though.

Bill

Reply via email to