On Monday 02 November 2009, Laurent Desnogues wrote: > That indeed looks strange: fixing the TB chaining on ARM > made nbench i386 three times faster. Note the gain was > less for FP parts of the benchmark due to the use of > helpers. > > out of curiosity could you post your tb_set_jmp_target1 > function?
I'm on an AMD64 host, so it's the same code as in mainline. > The only thing I can think of at the moment that > could make the code slower is that the program you ran > was not reusing blocks and/or cache flushing in > tb_set_jmp_target1 is overkill. There is no cache flushing in the AMD64 tb_set_jmp_target1() function, and the polarssl test suite is by nature rather repetitive. I did some experiments, and it seems disabling the TB chaining (by emptying tb_set_jmp_target()) does not have any impact on performance at all on AMD64. I tested it with several CPU-intensive programs (md5sum and the like) with AMD64 on AMD64 userspace emulation (qemu-x86_64), and the difference in performance with TB chaining and without is hardly measurable. The chaining is performed as advertised if enabled, I checked that, but it does not seem to help performance. How is this possible? Could this be related to cache size? I suspect the Phenom 9500 of mine is better equipped in that area than the average ARM controller. And does the TB chaining actually work on AMD64 at all? I checked by adding some debug output, and it seems to patch the jumps correctly, but maybe somebody can verify that. CU Uli -- SUSE LINUX Products GmbH, GF: Markus Rex, HRB 16746 (AG Nürnberg)