On Thu, Oct 23, 2008 at 8:13 AM, Antoine Pitrou <[EMAIL PROTECTED]> wrote:
> Is this kind of optimization that useful on modern CPUs? It helps remove a > memory access to the switch/case lookup table, which should shave off the 3 > CPU > cycles of latency of a modern L1 data cache, but it won't remove the branch > misprediction penalty of the indirect jump itself, which is more in the > order of > 10-20 CPU cycles depending on pipeline depth. I searched around for information on how threaded code interacts with branch prediction, and here's what I found. The short answer is that threaded code significantly improves branch prediction. Any bytecode interpreter has some kind of dispatch mechanism that jumps to the next opcode handler. With a while(1) {switch() {} } format, there's one dispatch location in the machine code. Processor branch prediction has a horrible time trying to predict where that dispatch location is going to jump to. Here's some rough psuedo-assembly: main_loop: compute next_handler jmp next_handler ; abysmal branch prediction handler1: ; do stuff jmp main_loop handler2: ; do stuff jmp main_loop With threaded code, every handler ends with its own dispatcher, so the processor can make fine-grained predictions. Since each opcode has its own indirect branch instruction, the processor can track them separately and make better predictions (e.g., it can figure out that opcode X is often followed by opcode Y). compute next_handler jmp next_handler ; executed only once handler1: ; do stuff compute next_handler jmp next_handler handler2: ; do stuff compute next_handler jmp next_handler -- Daniel Stutzbach, Ph.D. President, Stutzbach Enterprises, LLC <http://stutzbachenterprises.com>
_______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com