In article <4a501a5e$0$1640$742ec...@news.sonic.net>, John Nagle <na...@animats.com> wrote: > > Here's some actual code, from "tokenizer.py". This is called once >for each character in an HTML document, when in "data" state (outside >a tag). It's straightforward code, but look at all those >dictionary lookups. > > def dataState(self): > data = self.stream.char() > > # Keep a charbuffer to handle the escapeFlag > if self.contentModelFlag in\ > (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]): > if len(self.lastFourChars) == 4: > self.lastFourChars.pop(0) > self.lastFourChars.append(data) > > # The rest of the logic > if data == "&" and self.contentModelFlag in\ > (contentModelFlags["PCDATA"], contentModelFlags["RCDATA"]) and not\ > self.escapeFlag: > self.state = self.states["entityData"] > elif data == "-" and self.contentModelFlag in\ > (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and not\ > self.escapeFlag and "".join(self.lastFourChars) == "<!--": > self.escapeFlag = True > self.tokenQueue.append({"type": "Characters", "data":data}) > elif (data == "<" and (self.contentModelFlag == > contentModelFlags["PCDATA"] > or (self.contentModelFlag in > (contentModelFlags["CDATA"], > contentModelFlags["RCDATA"]) and > self.escapeFlag == False))): > self.state = self.states["tagOpen"] > elif data == ">" and self.contentModelFlag in\ > (contentModelFlags["CDATA"], contentModelFlags["RCDATA"]) and\ > self.escapeFlag and "".join(self.lastFourChars)[1:] == "-->": > self.escapeFlag = False > self.tokenQueue.append({"type": "Characters", "data":data}) > elif data == EOF: > # Tokenization ends. > return False > elif data in spaceCharacters: > # Directly after emitting a token you switch back to the "data > # state". At that point spaceCharacters are important so they are > # emitted separately. > self.tokenQueue.append({"type": "SpaceCharacters", "data": > data + self.stream.charsUntil(spaceCharacters, True)}) > # No need to update lastFourChars here, since the first space will > # have already broken any <!-- or --> sequences > else: > chars = self.stream.charsUntil(("&", "<", ">", "-")) > self.tokenQueue.append({"type": "Characters", "data": > data + chars}) > self.lastFourChars += chars[-4:] > self.lastFourChars = self.lastFourChars[-4:] > return True
Every single "self." is a dictionary lookup. Were you referring to those? If not, I don't see your point. If yes, well, that's kind of the whole point of using Python. You do pay a performance penalty. You can optimize out some lookups, but you need to switch to C for some kinds of computationally intensive algorithms. In this case, you can probably get a large boost out of Pysco or Cython or Pyrex. -- Aahz (a...@pythoncraft.com) <*> http://www.pythoncraft.com/ "as long as we like the same operating system, things are cool." --piranha -- http://mail.python.org/mailman/listinfo/python-list