-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> There are about 3 copies of everything (source, an abstract parse tree of
> every page which consumes an obscene amount of memory, and a Palm binary
> form) in memory by the time it's just about to write the binary file, as
> well as a number of additional dictionaries of various stuff attached to
> each node.

        Can't we cache these entries off to disk, even in one flat file
(Berk dbm?) and have pointers to the indices to the records into this file?
When the parser finishes, unroll these entries into the final pdb, and
unlink the file(s) from disk. Having to store a recursive array in memory of
this magnitude is going to really hurt as we scale to parallel gathering (if
Python/urllib2 can handle this).

> My decision after wrestling with this last fall is that a re-write of
> Spider.py would help a lot, but that any further optimization would have
> to turn it from a single-pass into a multi-pass compiler.

        How about we re-think the design, instead of try to optimize the
existing design? We've all got great ideas around how this "should" work,
and we've all written parsers on our own. It would really benefit to
standardize on some common design elements and implementations across the
languages in a "2.0" rewrite of the parser family.



d.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.0.6 (GNU/Linux)
Comment: For info see http://www.gnupg.org

iD8DBQE8/CCvkRQERnB1rkoRAjaGAJ9Kr1s87gwOWvoEqjEB8fxwtiwBUgCglK3L
Mg0notFG4zGIW7ziuSzCkGU=
=bWqx
-----END PGP SIGNATURE-----

Reply via email to