Re: Code that ought to run fast, but can't due to Python limitations.
Steven D'Aprano wrote: On Sun, 05 Jul 2009 01:58:13 -0700, Paul Rubin wrote: Steven D'Aprano st...@remove-this-cybersource.com.au writes: Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't see how a case statement would help you here: you're not dispatching on a value, but running through a series of tests until one passes. A case statement switch(x):... into a bunch of constant case labels would be able to use x as an index into a jump vector, and/or do an unrolled logarithmic (bisection-like) search through the tests, instead of a linear search. Yes, I'm aware of that, but that's not what John's code is doing -- he's doing a series of if expr ... elif expr tests. I don't think a case statement can do much to optimize that. (I didn't write that code; it's from http://code.google.com/p/html5lib/;, which is a general purpose HTML 5 parser written in Python. It's compatible with ElementTree and/or BeautifulSoup. I currently use a modified BeautifulSoup for parsing real-world HTML in a small-scale crawler, and I'm looking at this as an HTML 5 compatible replacement.) John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Steven D'Aprano wrote: On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak up and say, but whatever the problem is it's not going to be as simple as that. I didn't write this code; I'm just using it. As I said in the original posting, it's from http://code.google.com/p/html5lib;. It's from an effort to write a clean HTML 5 parser in Python for general-purpose use. HTML 5 parsing is well-defined for the awful cases that make older browsers incompatible, but quite complicated. The Python implementation here is intended partly as a reference implementation, so browser writers have something to compare with. I have a small web crawler robust enough to parse real-world HTML, which can be appallingly bad. I currently use an extra-robust version of BeautifulSoup, and even that sometimes blows up. So I'm very interested in a new Python parser which supposedly handles bad HTML in the same way browsers do. But if it's slower than BeautifulSoup, there's a problem. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle wrote: I have a small web crawler robust enough to parse real-world HTML, which can be appallingly bad. I currently use an extra-robust version of BeautifulSoup, and even that sometimes blows up. So I'm very interested in a new Python parser which supposedly handles bad HTML in the same way browsers do. But if it's slower than BeautifulSoup, there's a problem. Well, if performance matters in any way, you can always use lxml's blazingly fast parser first, possibly trying a couple of different configurations, and only if all fail, fall back to running html5lib over the same input. That should give you a tremendous speed-up over your current code in most cases, while keeping things robust in the hard cases. Note the numbers that Ian Bicking has for HTML parser performance: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/ You should be able to run lxml's parser ten times in different configurations (e.g. different charset overrides) before it even reaches the time that BeautifulSoup would need to parse a document once. Given that undeclared character set detection is something where BS is a lot better than lxml, you can also mix the best of both worlds and use BS's character set detection to configure lxml's parser if you notice that the first parsing attempts fail. And yes, html5lib performs pretty badly in comparison (or did, at the time). But the numbers seem to indicate that if you can drop the ratio of documents that require a run of html5lib below 30% and use lxml's parser for the rest, you will still be faster than with BeautifulSoup alone. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Stefan Behnel wrote: John Nagle wrote: I have a small web crawler robust enough to parse real-world HTML, which can be appallingly bad. I currently use an extra-robust version of BeautifulSoup, and even that sometimes blows up. So I'm very interested in a new Python parser which supposedly handles bad HTML in the same way browsers do. But if it's slower than BeautifulSoup, there's a problem. Well, if performance matters in any way, you can always use lxml's blazingly fast parser first, possibly trying a couple of different configurations, and only if all fail, fall back to running html5lib over the same input. Detecting fail is difficult. A common problem is badly terminated comments which eat most of the document if you follow the spec. The document seems to parse correctly, but most of it is missing. The HTML 5 spec actually covers things like !This is a bogus SGML directive and treats it as a bogus comment. (That's because HTML 5 doesn't include general SGML; the only directive recognized is DOCTYPE. Anything else after ! is treated as a token-level error.) So using an agreed-upon parsing method, in the form of html5lib, is desirable, in that it should mimic browser behavior. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Martin v. Löwis martin at v.loewis.de writes: This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. I looked at it with cProfile, and the top function that comes up for a larger document (52k) is ...validator.HTMLConformanceChecker.__iter__. [...] With this simple optimization, I get a 20% speedup on my test case. In my document, there are no attributes - the same changes should be made to attribute validation routines. I don't think this has anything to do with the case statement. I agree. I ran cProfile over just the tokenizer step; essentially tokenizer = html5lib.tokenizer.HTMLStream(htmldata) for tok in tokenizer: pass It mostly *isn't* tokenizer.py that's taking the most time, it's inputstream.py. (There is one exception: tokenizer.py:HTMLStream.__init__ constructs a dictionary of states each time -- this is unnecessary, replace all expressions like self.states[attributeName] with self.attributeNameState.) I've done several optimisations -- I'll upload the patch to the html5lib issue tracker. In particular, * The .position property of EncodingBytes is used a lot. Every self.position +=1 calls getPosition() and setPosition(). Another getPosition() call is done in the self.currentByte property. Most of these can be optimised away by using methods that move the position and return the current byte. * In HTMLInputStream, the current line number and column are updated every time a new character is read with .char(). The current position is *only* used in error reporting, so I reworked it to only calculate the position when .position() is called, by keeping track of the number of lines in previous read chunks, and computing the number of lines to the current offset in the current chunk. These give me about a 20% speedup. This just illustrates that the first step in optimisation is profiling :D As other posters have said, slurping the whole document into memory and using a regexp-based parser (such as pyparsing) would likely give you the largest speedups. If you want to keep the chunk- based approach, you can still use regexp's, but you'd have to think about matching on chunk boundaries. One way would be to guarantee a minimum number of characters available, say 10 or 50 (unless end-of-file, of course) -- long enough such that any *constant* string you'd want to match like ![CDATA[ would fit inside that guaranteed length. Any arbitrary-length tokens (such as attribute names and values) would be matched, not with regexps like [a-z]+, but with [a-z]{1,10} (match [a-z] from 1 to 10 times), and joining the individual matches together to make one token. Since html5lib has several implementations for several languages, it may actually be worth it to generate lexers for each language from one specification file. Take care, David M. Cooke david.m.co...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
In message 4a4f91f9$0$1587$742ec...@news.sonic.net, John Nagle wrote: (It should be written in C is not an acceptable answer.) I don't see why not. State machines that have to process input byte by byte are well known to be impossible to implement efficiently in high-level languages. That's why lex/flex isn't worth using. Even GCC has been moving to hand-coded parsers, first the lexical analyzer, and more recently even for the syntax parser (getting rid of yacc/bison), certainly for C. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
protocol = {start:initialiser,hunt:hunter,classify:classifier,other states} def state_machine(): next_step = protocol[start]() while True: next_step = protocol[next_step]() Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP, I still find this way of doing very simple and so intuitive (one will successfully argue how I was not figuring this out by myself if it was so intuitive). Anyway I wanted to participated to this thread, as soon as I saw 'due to python limitations' in the title, I foretold a hell of a thread ! This is just provocation ! :-) JM -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Jean-Michel Pichavant jeanmic...@sns.com wrote: Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP, I still find this way of doing very simple and so intuitive (one will successfully argue how I was not figuring this out by myself if it was so intuitive). Anyway I wanted to participated to this thread, as soon as I saw 'due to python limitations' in the title, I foretold a hell of a thread ! This is just provocation ! :-) The OP was not being provocative - he has a real problem, and the code he is complaining about already does more or less what my snippet showed, as I rushed in where angels fear to tread... The bit that was not clearly shown in what I proposed, is that you should stay in the individual states, testing for the reasons for the state transitions, until it is time to change - so there is a while loop in each of the individual states too. It becomes a terribly big structure if you have a lot of states, it duplicates a lot of tests across the different states, and it is very awkward if the states nest. Have a look at the one without the dict too - it is even faster as it avoids the dict lookup. That, however, is a bit like assembler code, as it kind of jumps from state to state, and there is no central thing to show what does, and what does not, belong together, as there is no dict. Not an easy beast to fix if it's big and it's wrong. - Hendrik -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
a...@pythoncraft.com (Aahz) writes: In article mailman.2639.1246802753.8015.python-l...@python.org, Hendrik van Rooyen m...@microcorp.co.za wrote: But wait - maybe if he passes an iterator around - the equivalent of for char in input_stream... Still no good though, unless the next call to the iterator is faster than an ordinary python call. Calls to iterators created by generators are indeed faster than an ordinary Python call, because the stack frame is already mostly set up. I think Beazely demonstrated this in his talk on using the python 2.5 co-routines to setup an xml parser. I believe he benchmarked it roughly and the initial results were rather impressive. http://www.dabeaz.com/coroutines/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Hendrik van Rooyen wrote: Jean-Michel Pichavant jeanmic...@sns.com wrote: Woot ! I'll keep this one in my mind, while I may not be that concerned by speed unlike the OP, I still find this way of doing very simple and so intuitive (one will successfully argue how I was not figuring this out by myself if it was so intuitive). Anyway I wanted to participated to this thread, as soon as I saw 'due to python limitations' in the title, I foretold a hell of a thread ! This is just provocation ! :-) The OP was not being provocative - he has a real problem, I was just kidding, asserting for python limitations in this list guarantees that the thread will last for several days, whether or not the assertion is right. JM -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle na...@animats.com wrote: As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see tokenizer.py in http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be needed in many places and will process large amounts of data. It's written entirely in Python. Take a look at how much work has to be performed per character. This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. (It should be written in C is not an acceptable answer.) You could compile it with Cython though. lxml took this route... -- Nick Craig-Wood n...@craig-wood.com -- http://www.craig-wood.com/nick -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle na...@...ats.com wrote: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version of the parser to decide in which order the IF statements should be executed. You shouldn't have to do that. You do not have to implement a state machine in a case statement, or in a series of if... elifs. Python is not C. Use a dispatch dict, and have each state return the next state. Then you can use strings representing state names, and everybody will be able to understand the code. toy example, not tested, nor completed: protocol = {start:initialiser,hunt:hunter,classify:classifier,other states} def state_machine(): next_step = protocol[start]() while True: next_step = protocol[next_step]() Simple, and almost as fast as if you did the same thing in assembler using pointers. And each state will have a finite set of reasons to either stay where its at, or move on. Not a lot you can do about that, but test for them one at a time. But at least you will have split the problem up, and you won't be doing irrelevant tests. You can even do away with the dict, by having each state return the actual next state routine: next_state = protocol_initialiser() while True: next_state = next_state() time.sleep(0.001) # this prevents thrashing when monitoring real events If you are using a gui, and you have access to an after callback, then you can make a ticking stutter thread to run some monitoring machine in the background, using the same tell me what to do next technique. To take the timing thing further, you can do: wait_time, next_state = protocol_initialiser() while True: if wait_time: time.sleep(wait_time) wait_time, next_state = next_state() This gives you control over busy-wait loops, and lets you speed up when it is needed. Python really is not C. - Hendrik -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
On Sat, 04 Jul 2009 20:15:11 -0700, John Nagle wrote: Paul Rubin wrote: John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. Maybe you could use a regexp (and then have -two- problems...) to find the token boundaries, then a dict to identify the actual token. Tables of character classes seem a bit less attractive in the Unicode era than in the old days. I want to see a regular expression that expresses the HTML 5 token parsing rules, including all the explicitly specified error handling. Obviously the regex can't do the error handling. Nor should you expect a single regex to parse an entire HTML document. But you could (perhaps) use regexes to parse pieces of the document, as needed. Have you investigated the pyparsing module? Unless you have some reason for avoiding it, for any complicated parsing job I'd turn to that before trying to roll your own. Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't see how a case statement would help you here: you're not dispatching on a value, but running through a series of tests until one passes. There are languages where you can write something like: case: x 0: process_positive(x) x 0: process_negative(x) x == 0: process_zero(x) but that's generally just syntactic sugar for the obvious if...elif... block. (Although clever compilers might recognise that it's the same x in each expression, and do something clever to optimize the code.) Nor do I see why you were complaining about Python not having true constants. I don't see how that would help you... most of your explicit dict lookups are against string literals e.g. contentModelFlags[RCDATA]. So while I feel your pain, I'm not sure I understand why you're blaming this on *Python*. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle wrote: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. Cython has a built-in optimisation that maps if-elif-else chains to C's switch statement if they only test a single int/char variable, even when you write things like elif x in [1,5,9,12]. This works in Cython, because we know that the comparison to a C int/char is side-effect free. It may not always be side-effect free in Python, so this won't work in general. It would be perfect for your case, though. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Steven D'Aprano st...@remove-this-cybersource.com.au writes: Okay, we get it. Parsing HTML 5 is a bitch. What's your point? I don't see how a case statement would help you here: you're not dispatching on a value, but running through a series of tests until one passes. A case statement switch(x):... into a bunch of constant case labels would be able to use x as an index into a jump vector, and/or do an unrolled logarithmic (bisection-like) search through the tests, instead of a linear search. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle wrote: Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data = self.stream.char() # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): Is the tuple (contentModelFlags[CDATA], contentModelFlags[RCDATA]) constant? If that is the case, I'd cut it out into a class member (or module-local variable) first thing in the morning. And I'd definitely keep the result of the in test in a local variable for reuse, seeing how many times it's used in the rest of the code. Writing inefficient code is not something to blame the language for. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle wrote: Paul Rubin wrote: John Nagle na...@animats.com writes: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version of the parser to decide in which order the IF statements should be executed. You shouldn't have to do that. In that particular program it would probably be better to change those if/elif/elif/else constructs to dictionary lookups. I see the program already does that for some large tables. A dictionary lookup (actually, several of them) for every input character is rather expensive. Did you implement this and prove your claim in benchmarks? Taking a look at the current implementation, I'm pretty sure a dict-based implementation would outrun it in your first try. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak up and say, but whatever the problem is it's not going to be as simple as that. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Steven D'Aprano st...@remove-this-cybersource.com.au writes: Yes, I'm aware of that, but that's not what John's code is doing -- he's doing a series of if expr ... elif expr tests. I don't think a case statement can do much to optimize that. The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. It could be that using ord(c) as an index into a list of functions might be faster than a dict lookup on c to get a function. I think John is hoping to avoid a function call and instead get an indexed jump within the Python bytecode for the big function. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Stefan Behnel wrote: John Nagle wrote: Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data = self.stream.char() # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): Is the tuple (contentModelFlags[CDATA], contentModelFlags[RCDATA]) constant? If that is the case, I'd cut it out into a class member (or module-local variable) first thing in the morning. Ah, and there's also this little trick to make it a (fast) local variable in that method: def some_method(self, some_const=(1,2,3,4)): ... Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin wrote: Steven D'Aprano writes: Yes, I'm aware of that, but that's not what John's code is doing -- he's doing a series of if expr ... elif expr tests. I don't think a case statement can do much to optimize that. The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. Although doing some of the tests first and then checking the input conditionally might be faster here. Another idea: You could exchange the methods whenever self.contentModelFlag changes, i.e. you'd have a dataState_CDATA, a dataState_PCDATA etc. It could be that using ord(c) as an index into a list of functions might be faster than a dict lookup on c to get a function. Rather unlikely, given that calling ord(c) involves a dict lookup for ord. You might get away with the pre-initialised keywords trick, though. I think John is hoping to avoid a function call and instead get an indexed jump within the Python bytecode for the big function. Hmm, yes, the actual code inside the conditionals is pretty short, so the call overhead might hurt here. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Stefan Behnel stefan...@behnel.de writes: # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): Is the tuple (contentModelFlags[CDATA], contentModelFlags[RCDATA]) constant? If that is the case, I'd cut it out into a class member ... I think the main issue for that function comes after that if statement. There is a multi-way switch on a bunch of different possible character values. I do agree with you that the first if can also be optimized. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Stefan Behnel stefan...@behnel.de writes: The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. Although doing some of the tests first and then checking the input conditionally might be faster here. That is essentially what happens. There are a bunch of tests of the form if data=='' and [some other stuff]: ... Because of short-circuit evaluation of and, the additional tests only happen once the character has been matched. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin wrote: Stefan Behnel writes: The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. Although doing some of the tests first and then checking the input conditionally might be faster here. That is essentially what happens. There are a bunch of tests of the form if data=='' and [some other stuff]: ... That's what I meant. Some of the other stuff is redundant enough to do it once at the beginning of the function (or even before entering the function, by writing specialised methods), i.e. I'd (partially) reverse the order of the and operands. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin wrote: Stefan Behnel writes: # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): Is the tuple (contentModelFlags[CDATA], contentModelFlags[RCDATA]) constant? If that is the case, I'd cut it out into a class member ... I think the main issue for that function comes after that if statement. There is a multi-way switch on a bunch of different possible character values. I do agree with you that the first if can also be optimized. You may notice that the creation of this exact tuple appears in almost all if the conditionals of this method. So it is part of the bottleneck. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Stefan Behnel stefan...@behnel.de writes: You may notice that the creation of this exact tuple appears in almost all if the conditionals of this method. So it is part of the bottleneck. I don't think so. The tuple is only created when the character has already matched, and for the vast majority of the chars in the input stream (ordinary text chars rather than html delimiters) none of them match. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin wrote: Stefan Behnel writes: You may notice that the creation of this exact tuple appears in almost all if the conditionals of this method. So it is part of the bottleneck. I don't think so. The tuple is only created when the character has already matched, and for the vast majority of the chars in the input stream (ordinary text chars rather than html delimiters) none of them match. Well, it's the second thing that happens when entering the method, it happens several times later on when specific characters are matched, and it also happens at the end when none of the special characters did match. So it /is/ part of the bottleneck because the dict lookups, the tuple creation, the in test and the tuple deallocation happen *twice* for almost all characters in the stream. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
On Jul 5, 3:12 am, Hendrik van Rooyen m...@microcorp.co.za wrote: Use a dispatch dict, and have each state return the next state. Then you can use strings representing state names, and everybody will be able to understand the code. toy example, not tested, nor completed: protocol = {start:initialiser,hunt:hunter,classify:classifier,other states} def state_machine(): next_step = protocol[start]() while True: next_step = protocol[next_step]() I've just spent about an hour looking over this code, with a few comments to inject to the thread here: - To all those suggesting the OP convert to a dispatch table, be assured that this code is well aware of this idiom. It is used HEAVILY at a macro level, picking through the various HTML states (starting a tag, reading attributes, reading body, etc.). There still are a number of cascading if-elif's within some of these states, and some of them *may* be candidates for further optimization. - There is an underlying HTMLInputStream that seems to be doing some unnecessary position bookkeeping (positionLine and positionCol). Commenting this out increases my test speed by about 13%. In my ignorance, I may be removing some important behavior, but this does not seem to be critical as I tested against a few megs of HTML source. Before blaming the tokenizer for everything, there may be more performance to be wrung from the input stream processor. For that matter, I would guess that about 90% of all HTML files that this code would process would easily fit in memory - in that case, the stream processing (and all of the attendant if I'm not at the end of the current chunk code) could be skipped/removed entirely. - The HTMLInputStream's charsUntil code is an already-identified bottleneck, and some re enhancements have been applied here to help out. - Run-time construction of tuple literals where the tuple members are constants can be lifted out. emitCurrentToken rebuilds this tuple every time it is called (which is a lot!): if (token[type] in (tokenTypes[StartTag], tokenTypes [EndTag], tokenTypes[EmptyTag])): Move this tuple literal into a class constant (or if you can tolerate it, a default method argument to gain LOAD_FAST benefits - sometimes optimization isn't pretty). - These kinds of optimizations are pretty small, and only make sense if they are called frequently. Tallying which states are called in my test gives the following list in decreasing frequency. Such a list would help guide your further tuning efforts: tagNameState194848 dataState 182179 attributeNameState 116507 attributeValueDoubleQuotedState 114931 tagOpenState105556 beforeAttributeNameState58612 beforeAttributeValueState 58216 afterAttributeValueState58083 closeTagOpenState 50547 entityDataState 1673 attributeValueSingleQuotedState 1098 commentEndDashState 372 markupDeclarationOpenState 370 commentEndState 364 commentStartState 362 commentState362 selfClosingStartTagState359 doctypePublicIdentifierDoubleQuotedState291 doctypeSystemIdentifierDoubleQuotedState247 attributeValueUnQuotedState 191 doctypeNameState32 beforeDoctypePublicIdentifierState 16 afterDoctypePublicIdentifierState 14 afterDoctypeNameState 9 doctypeState8 beforeDoctypeNameState 8 afterDoctypeSystemIdentifierState 6 afterAttributeNameState 5 commentStartDashState 2 bogusCommentState 2 For instance, I wouldn't bother doing much tuning of the bogusCommentState. Anything called fewer than 50,000 times in this test doesn't look like it would be worth the trouble. -- Paul (Thanks to those who suggested pyparsing as an alternative, but I think this code is already beyond pyparsing in a few respects. For one thing, this code works with an input stream, in order to process large HTML files; pyparsing *only* works with an in-memory string. This code can also take advantage of some performance short cuts, knowing that it is parsing HTML; pyparsing's generic classes can't do that.) -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle wrote: Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data = self.stream.char() # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): if len(self.lastFourChars) == 4: self.lastFourChars.pop(0) self.lastFourChars.append(data) # The rest of the logic if data == and self.contentModelFlag in\ (contentModelFlags[PCDATA], contentModelFlags[RCDATA]) and not\ self.escapeFlag: self.state = self.states[entityData] elif data == - and self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]) and not\ self.escapeFlag and .join(self.lastFourChars) == !--: self.escapeFlag = True self.tokenQueue.append({type: Characters, data:data}) elif (data == and (self.contentModelFlag == contentModelFlags[PCDATA] or (self.contentModelFlag in (contentModelFlags[CDATA], contentModelFlags[RCDATA]) and self.escapeFlag == False))): self.state = self.states[tagOpen] elif data == and self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]) and\ self.escapeFlag and .join(self.lastFourChars)[1:] == --: self.escapeFlag = False self.tokenQueue.append({type: Characters, data:data}) elif data == EOF: # Tokenization ends. return False elif data in spaceCharacters: # Directly after emitting a token you switch back to the data # state. At that point spaceCharacters are important so they are # emitted separately. self.tokenQueue.append({type: SpaceCharacters, data: data + self.stream.charsUntil(spaceCharacters, True)}) # No need to update lastFourChars here, since the first space will # have already broken any !-- or -- sequences else: chars = self.stream.charsUntil((, , , -)) self.tokenQueue.append({type: Characters, data: data + chars}) self.lastFourChars += chars[-4:] self.lastFourChars = self.lastFourChars[-4:] return True Giving this some more thought, I'd also try is to split the huge if-elif-else block like this: if data in string_with_all_special_characters: if data == '' ...: ... else: ... So there are three things to improve: - eliminate common subexpressions which you know are constant - split the large conditional sequence as shown above - use separate dataState() methods when inside and outside of CDATA/RCDATA blocks and (maybe) escaped blocks Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Steven D'Aprano st...@remove-this-cy..e.com.au wrote: On Sun, 05 Jul 2009 10:12:54 +0200, Hendrik van Rooyen wrote: Python is not C. John Nagle is an old hand at Python. He's perfectly aware of this, and I'm sure he's not trying to program C in Python. I'm not entirely sure *what* he is doing, and hopefully he'll speak up and say, but whatever the problem is it's not going to be as simple as that. I am well aware that John is not a newbie. He was complaining about Python's lack of a case statement in the context of a state machine. The point I was trying to make is that, if any state machine is examined, then, if you examine any one state, the reasons for leaving it (state transitions) is always a subset of the choices that _can_ be made. So that drawing a circle round each state in a state diagram, and making a routine to examine the arrows leaving that circle, and returning the destination point of the chosen arrow, is a way of splitting the job up, and results in making only the relevant decisions at the time of their relevance. This is in contrast to the classic C way of making one big case statement to implement a finite state machine, which gets its efficiency (if any) out of compiler optimisations such as replacing a skip chain with a jump table. I understand that it leads to a lot of what looks like boilerplate code, but he was looking for speed... - Hendrik -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin http://phr...@nospam.invalid wrote: The series of tests is written that way because there is no case statement available. It is essentially switching on a bunch of character constants and then doing some additional tests in each branch. It could be that using ord(c) as an index into a list of functions might be faster than a dict lookup on c to get a function. I think John is hoping to avoid a function call and instead get an indexed jump within the Python bytecode for the big function. I agree about the ord(c). However, avoiding the function call is, I think, right now, in the realms of wishful thinking. I cannot see how you could avoid a python function call - even if he bites the bullet and implements my laborious scheme, he would still have to fetch the next character to test against, inside the current state. So if it is the function calls that is slowing him down, I cannot imagine a solution using less than one per character, in which case he is screwed no matter what he does. But wait - maybe if he passes an iterator around - the equivalent of for char in input_stream... Still no good though, unless the next call to the iterator is faster than an ordinary python call. - Hendrik -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
2009/7/5 Hendrik van Rooyen m...@microcorp.co.za: I cannot see how you could avoid a python function call - even if he bites the bullet and implements my laborious scheme, he would still have to fetch the next character to test against, inside the current state. So if it is the function calls that is slowing him down, I cannot imagine a solution using less than one per character, in which case he is screwed no matter what he does. A simple solution may be to read the whole input HTML file in a string. This potentially requires lots of memory but I suspect that the use case by far most common for this parser is to build a DOM (or DOM-like) tree of the whole document. This tree usually requires much more memory that the HTML source itself. So, if the code duplication is acceptable, I suggest keeping this implementation for cases where the input is extremely big *AND* the whole program will work on it in streaming, not just the parser itself. Then write a simpler and faster parser for the more common case when the data is not huge *OR* the user will keep the whole document in memory anyway (e.g. on a tree). Also: profile, profile a lot. HTML pages are very strange beasts and the bottlenecks may be in innocent-looking places! -- Lino Mastrodomenico -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
In article mailman.2639.1246802753.8015.python-l...@python.org, Hendrik van Rooyen m...@microcorp.co.za wrote: But wait - maybe if he passes an iterator around - the equivalent of for char in input_stream... Still no good though, unless the next call to the iterator is faster than an ordinary python call. Calls to iterators created by generators are indeed faster than an ordinary Python call, because the stack frame is already mostly set up. -- Aahz (a...@pythoncraft.com) * http://www.pythoncraft.com/ as long as we like the same operating system, things are cool. --piranha -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. I looked at it with cProfile, and the top function that comes up for a larger document (52k) is ...validator.HTMLConformanceChecker.__iter__. This method dispatches various validation routines, and it computes the method names from the input over and over again, doing lots of redundant string concatenations. It also capitalizes the element names, even though the spelling in the original document is probably not capitalized (but either upper-case or lower case). In my patch below, I create a dictionary of bound methods, indexed by (syntax) type and name, following the logic of falling back to just type-based validation if no type/name routine exists. However, in order to reduce the number of dictionary lookups, it will also cache type/name pairs (both in the original spelling, and the capitalized spelling), so that subsequent occurrences of the same element will hit the method cache. With this simple optimization, I get a 20% speedup on my test case. In my document, there are no attributes - the same changes should be made to attribute validation routines. I don't think this has anything to do with the case statement. Regards, Martin diff -r 30ba63d28b1b python/src/html5lib/filters/validator.py --- a/python/src/html5lib/filters/validator.py Fri Jul 03 17:47:34 2009 +0300 +++ b/python/src/html5lib/filters/validator.py Sun Jul 05 21:10:06 2009 +0200 @@ -18,6 +18,7 @@ # Import from the sets module for python 2.3 from sets import Set as set from sets import ImmutableSet as frozenset +import re import _base import iso639codes import rfc3987 @@ -265,19 +266,45 @@ self.thingsThatDefineAnID = [] self.thingsThatPointToAnID = [] self.IDsWeHaveKnownAndLoved = [] +self.validate_type = {} +self.validate_type_name = {} +r = re.compile(^validate([A-Z][^A-Z]+)([A-Z][^A-Z]+)?$) +for name in dir(self): +m = r.match(name) +if not m: continue +method = getattr(self, name) +if m.group(2): +d = self.validate_type_name.setdefault(m.group(1), {}) +d[m.group(2)] = method +else: +self.validate_type[m.group(1)] = method def __iter__(self): -types = dict((v,k) for k,v in tokenTypes.iteritems()) for token in _base.Filter.__iter__(self): -fakeToken = {type: types.get(token.get(type, -), -), - name: token.get(name, -).capitalize()} -method = getattr(self, validate%(type)s%(name)s % fakeToken, None) +t = token.get(type, -) +n = token.get(name, -) +try: +# try original name spelling +method = self.validate_type_name[t][n] +except KeyError: +# try capitalization +cn = n.capitalize() +try: +method = self.validate_type_name[t][cn] +# also cache original spelling +self.validate_type_name[t][n] = method +except KeyError: +# No name-specific validateion, try type-specific one +try: +method = self.validate_type[t] +# cache as name-specific as well +self.validate_type_name[t][cn] = method +self.validate_type_name[t][n] = method +except KeyError: +# no validation available +method = None if method: for t in method(token) or []: yield t -else: -method = getattr(self, validate%(type)s % fakeToken, None) -if method: -for t in method(token) or []: yield t yield token for t in self.eof() or []: yield t -- http://mail.python.org/mailman/listinfo/python-list
Code that ought to run fast, but can't due to Python limitations.
As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see tokenizer.py in http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be needed in many places and will process large amounts of data. It's written entirely in Python. Take a look at how much work has to be performed per character. This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. (It should be written in C is not an acceptable answer.) Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version of the parser to decide in which order the IF statements should be executed. You shouldn't have to do that. Yes, I've read PEP 3103. The big problem is the difficulty of figuring out what's a constant and what might change. If all the cases are constants, case statements are easy. But in Python, the compiler can't tell. Parsers have many named compile-time constants. Python doesn't support named compile-time constants, and this is one of the places where we have to pay the bill for that limitation. Something to think about when you need three more racks of servers because the HTML parser is slow. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
On Sat, Jul 4, 2009 at 1:33 PM, John Naglena...@animats.com wrote: As an example of code that really needs to run fast, but is speed-limited by Python's limitations, see tokenizer.py in http://code.google.com/p/html5lib/ This is a parser for HTML 5, a piece of code that will be needed in many places and will process large amounts of data. It's written entirely in Python. Take a look at how much work has to be performed per character. This is a good test for Python implementation bottlenecks. Run that tokenizer on HTML, and see where the time goes. (It should be written in C is not an acceptable answer.) Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version of the parser to decide in which order the IF statements should be executed. You shouldn't have to do that. If you're cases are hashable, just use a dict instead of the if chain. Then you get the constant time access to it. def func_a() : ... def func_b(): ... def func_c(): case = {a:func_a, b:func_b, c:func_c} case[value]() Yes, I've read PEP 3103. The big problem is the difficulty of figuring out what's a constant and what might change. If all the cases are constants, case statements are easy. But in Python, the compiler can't tell. Parsers have many named compile-time constants. Python doesn't support named compile-time constants, and this is one of the places where we have to pay the bill for that limitation. Something to think about when you need three more racks of servers because the HTML parser is slow. John Nagle -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle wrote: [ ... ] Parsers have many named compile-time constants. Python doesn't support named compile-time constants, and this is one of the places where we have to pay the bill for that limitation. Something to think about when you need three more racks of servers because the HTML parser is slow. One technique used in such a case is to dispatch different case-handling functions via a dictionary lookup. Mel. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle na...@animats.com writes: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version of the parser to decide in which order the IF statements should be executed. You shouldn't have to do that. In that particular program it would probably be better to change those if/elif/elif/else constructs to dictionary lookups. I see the program already does that for some large tables. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin wrote: John Nagle na...@animats.com writes: Python doesn't have a switch or case statement, and when you need a state machine with many states, that makes for painful, slow code. ... There's a comment in the code that it would be useful to run a few billion lines of HTML through an instrumented version of the parser to decide in which order the IF statements should be executed. You shouldn't have to do that. In that particular program it would probably be better to change those if/elif/elif/else constructs to dictionary lookups. I see the program already does that for some large tables. A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. This is an issue that comes up whenever you have to parse some formal structure, from XML/HTML to Pickle to JPEG images to program source. If Python could figure out what's a constant and what isn't during compilation, this sort of thing could be much more efficient. In fact, you don't even need a switch statement at the source level, if the language is such that the compiler can figure out when elif clauses are mutually exclusive. The temptation is to write tokenizers in C, but that's an admission of language design failure. (A general problem with Python is hidden dynamism. That is, changes to variables that can't be found by examining the source. This is a killer for optimizations. One could take the position that any module variable with exactly one visible assignment to it is, in fact, only assigned in one place, and if the right hand side is a constant, the variable is a constant. This would break some programs doing funny stuff with eval, or using some of the backdoor ways to modify variables, but that's very rare in practice. In return, you get the ability to hard-compile more of Python into fast code. I'm thinking Shed Skin here, not yet another attempt at a JIT system.) On the other hand, trying to do this in Perl, where you can't even index strings, is far worse. John Nagle -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. Maybe you could use a regexp (and then have -two- problems...) to find the token boundaries, then a dict to identify the actual token. Tables of character classes seem a bit less attractive in the Unicode era than in the old days. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
On Sat, 04 Jul 2009 16:35:08 -0700, John Nagle wrote: The temptation is to write tokenizers in C, but that's an admission of language design failure. The only part that really needs to be written in C is the DFA loop. The code to construct the state table from regexps could be written entirely in Python, but I don't see any advantage to doing so. -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. This is an issue that comes up whenever you have to parse some formal structure, from XML/HTML to Pickle to JPEG images to program source. […] The temptation is to write tokenizers in C, but that's an admission of language design failure. This sounds like a job for URL:http://pyparsing.wikispaces.com/ Pyparsing. -- \ “Better not take a dog on the space shuttle, because if he | `\ sticks his head out when you're coming home his face might burn | _o__)up.” —Jack Handey | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Code that ought to run fast, but can't due to Python limitations.
Paul Rubin wrote: John Nagle na...@animats.com writes: A dictionary lookup (actually, several of them) for every input character is rather expensive. Tokenizers usually index into a table of character classes, then use the character class index in a switch statement. Maybe you could use a regexp (and then have -two- problems...) to find the token boundaries, then a dict to identify the actual token. Tables of character classes seem a bit less attractive in the Unicode era than in the old days. I want to see a regular expression that expresses the HTML 5 token parsing rules, including all the explicitly specified error handling. Here's some actual code, from tokenizer.py. This is called once for each character in an HTML document, when in data state (outside a tag). It's straightforward code, but look at all those dictionary lookups. def dataState(self): data = self.stream.char() # Keep a charbuffer to handle the escapeFlag if self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]): if len(self.lastFourChars) == 4: self.lastFourChars.pop(0) self.lastFourChars.append(data) # The rest of the logic if data == and self.contentModelFlag in\ (contentModelFlags[PCDATA], contentModelFlags[RCDATA]) and not\ self.escapeFlag: self.state = self.states[entityData] elif data == - and self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]) and not\ self.escapeFlag and .join(self.lastFourChars) == !--: self.escapeFlag = True self.tokenQueue.append({type: Characters, data:data}) elif (data == and (self.contentModelFlag == contentModelFlags[PCDATA] or (self.contentModelFlag in (contentModelFlags[CDATA], contentModelFlags[RCDATA]) and self.escapeFlag == False))): self.state = self.states[tagOpen] elif data == and self.contentModelFlag in\ (contentModelFlags[CDATA], contentModelFlags[RCDATA]) and\ self.escapeFlag and .join(self.lastFourChars)[1:] == --: self.escapeFlag = False self.tokenQueue.append({type: Characters, data:data}) elif data == EOF: # Tokenization ends. return False elif data in spaceCharacters: # Directly after emitting a token you switch back to the data # state. At that point spaceCharacters are important so they are # emitted separately. self.tokenQueue.append({type: SpaceCharacters, data: data + self.stream.charsUntil(spaceCharacters, True)}) # No need to update lastFourChars here, since the first space will # have already broken any !-- or -- sequences else: chars = self.stream.charsUntil((, , , -)) self.tokenQueue.append({type: Characters, data: data + chars}) self.lastFourChars += chars[-4:] self.lastFourChars = self.lastFourChars[-4:] return True John Nagle -- http://mail.python.org/mailman/listinfo/python-list