Kent Johnson wrote:
>>Bernard Lebel wrote:
>>>Btw in case you wonder, I don't use BeautifulSoup because somehow it
>>>takes 20-30 seconds to parse a 2000-line xml file, and I don't know
>>>why. ElementTree is proving very performing.
>
> I took a bit of a look at this using the Python profiler. The most notable
> thing is the staggering number
> of times some functions are called. The first column (ncalls) is the
> total number of calls of a function. The second column (tottime) is
> the total time spent in the function, not counting the time spent in
> lower-level functions.
>
> If you look at the list, for a while the functions are being called
> 777 times. This is probably the number of start tags in the document.
> But when you get to recursiveChildGenerator(), all of a sudden it is
> called 898655 times, over 1000 times for each call to _fetch()! This
> is a staggering number of calls, it is called 8 times for every
> character in the file!
I looked at this again and there is a bug in BS that causes this behaviour.
It's kind of an interesting bug that is a side-effect of the way BS uses
introspection to access child tags.
The problem begins at line 790:
isResetNesting = self.RESET_NESTING_TAGS.has_key(name)
This looks innocent. The problem is that self.RESET_NESTING_TAGS is not
defined. This forces a call to BeautifulStoneSoup.__getattr__() which calls
Tag.__getattr__() and triggers a search for a child tag called
RESET_NESTING_TAGS. I think the reason Bernard's file has such a hard time with
this is because he has quite a few child tags under some of the tags. When each
tag is created, the list of tags is iterated again.
Anyway I don't have time for a longer explanation right now. The fix is really
simple - just add the line
RESET_NESTING_TAGS = {}
after line 586.
I'll send a bug report to the author of BS.
Kent
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor