> When the parser was using more or less all available memory I had to
> kill it. At that time it said "---- 802 collected, 567 to do ----", 
> not bad for a document with about 80 html pages and 20 images ;-)

Hmmmm.  I don't see this under Solaris, with the same parser.  But it
does ramp up to several hundred images, as each of the "Next", "Up",
etc. buttons are getting done over and over for each page.  I see a
max of about 575 records :-).  Remember that email links are also
counted in that number.  There are about 170 total refs in the manual,
counting nodes, email links, and images.

The issue is the bogosity of attribute inheritance in
Spider.SpiderLink, which I was planning to eliminate but haven't
gotten around to yet.  Each of the links around the buttons is given a
different "name", which is then passed as an attribute to the parsing
of the IMG tag, which makes two refs to the same image different
(after all, we don't know that the image parser isn't going to want to
do something with that tag).

The right thing to do is to alter the __init__ proc in
Spider.py/SpiderLink to remove any entry from the dictionary "old"
which has the key "name", as I've already done with "id", "class", and
"href".

I've fixed this and checked it in.

Bill

Reply via email to