Kent Johnson wrote: >>Bernard Lebel wrote: >>>Btw in case you wonder, I don't use BeautifulSoup because somehow it >>>takes 20-30 seconds to parse a 2000-line xml file, and I don't know >>>why. ElementTree is proving very performing. > > I took a bit of a look at this using the Python profiler. The most notable > thing is the staggering number > of times some functions are called. The first column (ncalls) is the > total number of calls of a function. The second column (tottime) is > the total time spent in the function, not counting the time spent in > lower-level functions. > > If you look at the list, for a while the functions are being called > 777 times. This is probably the number of start tags in the document. > But when you get to recursiveChildGenerator(), all of a sudden it is > called 898655 times, over 1000 times for each call to _fetch()! This > is a staggering number of calls, it is called 8 times for every > character in the file!
I looked at this again and there is a bug in BS that causes this behaviour. It's kind of an interesting bug that is a side-effect of the way BS uses introspection to access child tags. The problem begins at line 790: isResetNesting = self.RESET_NESTING_TAGS.has_key(name) This looks innocent. The problem is that self.RESET_NESTING_TAGS is not defined. This forces a call to BeautifulStoneSoup.__getattr__() which calls Tag.__getattr__() and triggers a search for a child tag called RESET_NESTING_TAGS. I think the reason Bernard's file has such a hard time with this is because he has quite a few child tags under some of the tags. When each tag is created, the list of tags is iterated again. Anyway I don't have time for a longer explanation right now. The fix is really simple - just add the line RESET_NESTING_TAGS = {} after line 586. I'll send a bug report to the author of BS. Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor