Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
> Since fxsave leaves the FPU state intact, there ought to be a better way > to do this but it gets tricky. Maybe using the TSC to put a timestamp > in every thread save area? > > when saving FPU state: > put cpu# and timestamp in thread state info > also store timestamp in per-cpu data > > on task switch: > compare cpu# and timestamps for next task > if equal, clear TS and set TS_USEDFPU > > when state becomes invalid for some reason: > zero cpu's timestamp > > But the extra overhead might be too much in many cases. Simpler: - Thread has "CPU that I last used FPU on" pointer. Never NULL. - Each CPU has "thread whose FPU state I hold" pointer. May be NULL. When *loading* FPU state: - Set up both pointers. On task switch: - If the pointers point to each other, then clear TS and skip restore. ("Preloaded") When state becomes invalid (kernel MMX use, or whatever) - Set CPU's pointer to NULL. On thread creation: - If current CPU's thread pointer points to the newly allocated thread, clear it to NULL. - Set thread's CPU pointer to current CPU. The UP case just omits the per-thread CPU pointer. (Well, stores it in zero bits.) An alternative SMP thread-creation case would be to have a NULL value for the thread-to-CPU pointer and initialize the thread's CPU pointer to that, but that then complicates the UP case. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Sat, 23 Jul 2005 at 10:38:40 -0700 (PDT), Linus Torvalds wrote: > On Sat, 23 Jul 2005, Chuck Ebbert wrote: > > > > This patch (may not apply cleanly to unpatched -rc3) drops overhead > > a few more percent: > > That really is pretty ugly. > > I'd rather hope for something like function-sections to just make games > like this be unnecessary. The link stage might cause meltdown on typical machines, though. __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Tue, 26 Jul 2005, Chuck Ebbert wrote: > > Since fxsave leaves the FPU state intact, there ought to be a better way to > do > this but it gets tricky. Maybe using the TSC to put a timestamp in every > thread > save area? We used to have totally lazy FP saving, and not toucht he FP state at _all_ in the scheduler except to just set the TS bit. It worked wonderfully well on UP, but getting it working on SMP is a major pain, since the lazy state you want to switch back into might be cached on some other CPU's registers, so we never did it on SMP. Eventually it got too painful to maintain two totally different logical code-paths between UP and SMP, and some bug or other ended up resulting in the current "lazy on a time slice level" thing which works well in SMP too. Also, a lot of the cost is really the save, and before SSE2 the fnsave would clear the FPU state, so you couldn't just do a save and try to elide just the restore in the lazy case. In SSE2 (with fxsave) we _could_ try to do that, but the thing is, I doubt it really helps. First off, 99% of all programs don't hit the nasty case at all, and for something broken like volanomark that _does_ hit it, I bet that there i smore than one thread using the FP, so you can't just cache the FP state in the CPU _anyway_. So we could enhance the current state by having a "nonlazy" mode like in the example patch, except we'd have to make it a dynamic flag. Which could either be done by explicitly marking binaries we want to be non-lazy, or by just dynamically noticing that the rate of FP restores is very high. Does anybody really care about volanomark? Quite frankly, I think you'd see a _lot_ more performance improvement if you could instead teach the Java stuff not to use FP all the time, so it feels a bit like papering over the _real_ bug if we'd try to optimize this abnormal and silly case in the kernel. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Fri, 22 Jul 2005 at 11:13:21 -0700 (PDT), Linus Torvalds wrote: > Something like the following (totally untested) should make it be > non-lazy. It's going to slow down normal task switches, but might speed up > the "restoring FP context all the time" case. > > Chuck? This should work fine with or without your inline thing. Does it > make any difference? Now that I am looking at the right output file -- yes, it does (after fixing one small bug.) Volanomark set a new record of 6125 messages/sec and the profile shows almost zero hits in math_state_restore. Since fxsave leaves the FPU state intact, there ought to be a better way to do this but it gets tricky. Maybe using the TSC to put a timestamp in every thread save area? when saving FPU state: put cpu# and timestamp in thread state info also store timestamp in per-cpu data on task switch: compare cpu# and timestamps for next task if equal, clear TS and set TS_USEDFPU when state becomes invalid for some reason: zero cpu's timestamp But the extra overhead might be too much in many cases. (Below is what I changed in the original patch, if anyone wants to try it.) + if (tsk_used_math(next_p)) should be: + if (!tsk_used_math(next_p)) __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
Linus Torvalds wrote: On Fri, 22 Jul 2005, Adrian Bunk wrote: If this patch makes a difference, could you do me a favour and check whether replacing the current cpu_has_fxsr #define in include/asm-i386/cpufeature.h with #define cpu_has_fxsr 1 on top of your patch brings an additional improvement? It would be really sad if it made a difference. There might be a branch mispredict, but the real expense of the fnsave/fxsave will be that instruction itself, and any cache misses associated with it. The 9% performace difference would almost have to be due to a memory bank conflict or something (likely some unnecessary I$ prefetching that interacts badly with the writeback needed for the _big_ memory write forced by the fxsave). I can't see any way that a single branch mispredict could make that big of a difference, but I _can_ see how bad memory access patterns could do it. Btw, the switch from fnsave to fxsave (and thus the change from a 112-byte save area to a 512-byte one, or whatever the exact details are) caused _huge_ performance degradation for various context switching benchmarks. I really hated that, but obviously the need to support SSE2 made it non-optional. The point being that the real overhead is that big memory read/write in fxrestor/fxsave. What _could_ make a bigger difference is not doing the lazy FPU at all. That lazy FPU is a huge optimization on 99.9% of all loads, but it sounds like java/volanomark are broken and always use the FPU, and then we take a big hit on doing the FP restore exception (an exception is a lot more expensive than a mispredict). It seems expensive to do the save/restore when it isn't needed, that's why the code got lazy. Would it be useful to have a small flag or count field and start by assuming that FPU is not used, and if the exception takes place set the count to unconditionally save the FP state for some number of context switches and then try reverting to lazy save? That 99.9% may be a guess, but I suspect that there are a lot of applications which alternate between using FPU and not, even if they do use FPU for some parts of the application. That way the performance of lazy save would be realized for the common applications, and the overhead of exception would be greatly reduced for both the ill-behaved and legitimately FPU intensive application. Something like the following (totally untested) should make it be non-lazy. It's going to slow down normal task switches, but might speed up the "restoring FP context all the time" case. Chuck? This should work fine with or without your inline thing. Does it make any difference? -- -bill davidsen ([EMAIL PROTECTED]) "The secret to procrastination is to put things off until the last possible moment - but no longer" -me - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Sat, 23 Jul 2005, Arjan van de Ven wrote: > > For the kernel the runtime measurement is obviously tricky (kgprof > anyone?) Ehh, doesn't "opgprof" do that already? Anyway, judging by real life, people just don't _do_ profile runs. Any build automation that depends on the user profiling the result is a total wet dream by compiler developers - it just doesn't happen in real life. So: > but the static analysis method really shouldn't be too hard. I really think that the static analysis is the only one that actually would matter in real life. It's also the only one that is repeatable, which is a big big bonus, since at least that way different people are running basically the same code (heisenbugs, anyone?) and benchmarks are actually meaningful, since they don't depend on whatever load was picked to generate the layout. So dynamic analysis with dynamic profile data is just one big theoretical compiler-writer masturbation session, in my not-so-humble opinion. I bet static analysis (with possibly some hints from the programmer, the same way we use likely/unlikely) gets you 90% of the way, without all the downsides. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Sat, 2005-07-23 at 10:38 -0700, Linus Torvalds wrote: > So maybe a few hints to the binutils people might just make them go: "try > this patch/cmdline flag", and solve many more problems. They likely have a > lot of this kind of code _already_, or have at least been thinking about > it. > > I personally believe that there's likely a lot more to be had from code > (and data) layout than there is from things like alias analysis or > aggressive inlining. for userland it's not that complex and exists already; basically gprof has this analysis capability, and from that there's tooling (well I wrote it and I'm sure others did too) to create a linker script automatically to order things according to their gprof desired order. This gprof based approach is taking actual runtime patterns into account not just static callgraph analysis. For the kernel the runtime measurement is obviously tricky (kgprof anyone?) but the static analysis method really shouldn't be too hard. (well I guess "optimal" will be NP complete, but "pretty damn close" ought to be reasonable) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Sat, 23 Jul 2005, Chuck Ebbert wrote: > > This patch (may not apply cleanly to unpatched -rc3) drops overhead > a few more percent: That really is pretty ugly. I'd rather hope for something like function-sections to just make games like this be unnecessary. The linker really should be able to do things like this for us (ie if it sees that the only reference to a function is from another function, it should be able to just put them next to each other anyway). And yes, I realize that "should be able" isn't the same thing as "does", but the thing is, doing it by hand ends up being a maintenance problem in the long run. It also misses all the other cases where it might be beneficial, but where you don't happen to run the right benchmark or look at the right place. So maybe a few hints to the binutils people might just make them go: "try this patch/cmdline flag", and solve many more problems. They likely have a lot of this kind of code _already_, or have at least been thinking about it. I personally believe that there's likely a lot more to be had from code (and data) layout than there is from things like alias analysis or aggressive inlining. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Sat, 23 Jul 2005, Chuck Ebbert wrote: > >Probably a very minor point, but shouldn't it be > > "m" ((tsk)->thread.i387.fxsave)) > > because that's the largest possible operand that could end up being read? Yes, I ended up fixing that already (along with handling the fxsave case too). > Is that function called __save_init_fpu because it saves and then > initializes the fpu? Unlike fsave, fxsave does not init the fpu after > it saves the state; to make the behavior match it should be "fxsave ; fninit" The "init" part is really just "reset_exceptions" - we don't care about the rest of the state, and the naming is historical (ie it's really called "init" because "fnsave" does the reset by re-initializing everything, ie there's an implied "fninit"). So for the fxsave path, we just use fnclex, since that's faster than a full fninit. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
Andrew Morton <[EMAIL PROTECTED]> writes: > > We do have the `used_math' optimisation in there which attempts to avoid > doing the FP save/restore if the app isn't actually using math. But > there's code in glibc startup which always does a > bit of float, so that optimisation is always defeated. There was some > discussion about periodically setting tasks back into !used_math state to > try to restore the optimisation for tasks which only do a little bit of FP, > but nothing actually got done. Actually we reset the flag on every context switch, so that works just fine. But I was considering to do it less often so that we switch the FP state non lazily for FP intensive processes and avoid the overhead of all these exceptions. -Andi P.S.: Original profile data looks a bit fishy. Normally avoiding a single function call should not make tht much difference unless you call it in a inner loop, but that is not the case here. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Fri, 22 Jul 2005 at 13:27:56 +1000, Andrew Morton wrote: > Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > After: > > > > 12534 math_state_restore 130.5625 > > 8354 device_not_available 203.7561 > > - > > 20888 > > Next step should be to physically place math_state_restore() after > > device_not_available(). Would such a patch be accepted? (Yes it > > would be ugly and require linker script changes.) > > Depends on the benefit/ugly ratio ;) This patch (may not apply cleanly to unpatched -rc3) drops overhead a few more percent: 11835 math_state_restore 144.3293 8096 device_not_available 197.4634 - 19931 Most of the exception/interrupt handlers should get the same treatment but the linker script would grow. Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]> Index: 2.6.13-rc3a/arch/i386/kernel/entry.S === --- 2.6.13-rc3a.orig/arch/i386/kernel/entry.S 2005-07-16 18:09:48.0 -0400 +++ 2.6.13-rc3a/arch/i386/kernel/entry.S2005-07-22 05:19:22.0 -0400 @@ -469,6 +469,7 @@ pushl $do_simd_coprocessor_error jmp error_code + .section ".text.i387_1", "a" ENTRY(device_not_available) pushl $-1 # mark this as an int SAVE_ALL @@ -484,6 +485,8 @@ addl $4, %esp jmp ret_from_exception + .previous + /* * Debug traps and NMI can happen at the one SYSENTER instruction * that sets up the real kernel stack. Check here, since we can't Index: 2.6.13-rc3a/arch/i386/kernel/traps.c === --- 2.6.13-rc3a.orig/arch/i386/kernel/traps.c 2005-07-16 18:05:40.0 -0400 +++ 2.6.13-rc3a/arch/i386/kernel/traps.c2005-07-22 14:02:50.0 -0400 @@ -972,7 +972,8 @@ * Must be called with kernel preemption disabled (in this case, * local interrupts are disabled at the call-site in entry.S). */ -asmlinkage void math_state_restore(struct pt_regs regs) +asmlinkage void __attribute__((__section__(".text.i387_2"))) +math_state_restore(struct pt_regs regs) { struct thread_info *thread = current_thread_info(); struct task_struct *tsk = thread->task; Index: 2.6.13-rc3a/arch/i386/kernel/vmlinux.lds.S === --- 2.6.13-rc3a.orig/arch/i386/kernel/vmlinux.lds.S 2005-07-16 18:05:39.0 -0400 +++ 2.6.13-rc3a/arch/i386/kernel/vmlinux.lds.S 2005-07-22 05:20:30.0 -0400 @@ -20,6 +20,8 @@ _text = .; /* Text and read-only data */ .text : AT(ADDR(.text) - LOAD_OFFSET) { *(.text) + *(.text.i387_1) + *(.text.i387_2) SCHED_TEXT LOCK_TEXT *(.fixup) __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Date: Fri, 22 Jul 2005 at 16:19:25 -0700 (PDT), Linus Torvalds wrote: > /* >* The "nop" is needed to make the instructions the same >* length. >*/ > #define restore_fpu(tsk)\ > alternative_input( \ > "nop ; frstor %1", \ > "fxrstor %1", \ > X86_FEATURE_FXSR, \ > "m" ((tsk)->thread.i387.fsave)) Probably a very minor point, but shouldn't it be "m" ((tsk)->thread.i387.fxsave)) because that's the largest possible operand that could end up being read? > Now, we should do the same for "fnsave ; fwait" vs "fxsave ; fnclex" too, > but I was lazy. If somebody wants to try that, it would need an > "alternative_output()" define but should otherwise be trivial too. Is that function called __save_init_fpu because it saves and then initializes the fpu? Unlike fsave, fxsave does not init the fpu after it saves the state; to make the behavior match it should be "fxsave ; fninit" __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Thu, 21 Jul 2005, Chuck Ebbert wrote: > > This patch makes restore_fpu() an inline. When L1/L2 cache are saturated > it makes a measurable difference. I've now pushed out an alternative fix for this, which imho is much cleaner. We've long had infrastructure for "alternative asm instructions" that are conditional on CPU features, and I just made restore_fpu() use that instead: /* * The "nop" is needed to make the instructions the same * length. */ #define restore_fpu(tsk)\ alternative_input( \ "nop ; frstor %1", \ "fxrstor %1", \ X86_FEATURE_FXSR, \ "m" ((tsk)->thread.i387.fsave)) which not only makes it inline, but makes it a single instruction instead of the mess it was before. Now, we should do the same for "fnsave ; fwait" vs "fxsave ; fnclex" too, but I was lazy. If somebody wants to try that, it would need an "alternative_output()" define but should otherwise be trivial too. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Fri, 22 Jul 2005, Adrian Bunk wrote: > > If this patch makes a difference, could you do me a favour and check > whether replacing the current cpu_has_fxsr #define in > include/asm-i386/cpufeature.h with > > #define cpu_has_fxsr 1 > > on top of your patch brings an additional improvement? It would be really sad if it made a difference. There might be a branch mispredict, but the real expense of the fnsave/fxsave will be that instruction itself, and any cache misses associated with it. The 9% performace difference would almost have to be due to a memory bank conflict or something (likely some unnecessary I$ prefetching that interacts badly with the writeback needed for the _big_ memory write forced by the fxsave). I can't see any way that a single branch mispredict could make that big of a difference, but I _can_ see how bad memory access patterns could do it. Btw, the switch from fnsave to fxsave (and thus the change from a 112-byte save area to a 512-byte one, or whatever the exact details are) caused _huge_ performance degradation for various context switching benchmarks. I really hated that, but obviously the need to support SSE2 made it non-optional. The point being that the real overhead is that big memory read/write in fxrestor/fxsave. What _could_ make a bigger difference is not doing the lazy FPU at all. That lazy FPU is a huge optimization on 99.9% of all loads, but it sounds like java/volanomark are broken and always use the FPU, and then we take a big hit on doing the FP restore exception (an exception is a lot more expensive than a mispredict). Something like the following (totally untested) should make it be non-lazy. It's going to slow down normal task switches, but might speed up the "restoring FP context all the time" case. Chuck? This should work fine with or without your inline thing. Does it make any difference? Linus - diff --git a/arch/i386/kernel/process.c b/arch/i386/kernel/process.c --- a/arch/i386/kernel/process.c +++ b/arch/i386/kernel/process.c @@ -678,8 +678,16 @@ struct task_struct fastcall * __switch_t struct tss_struct *tss = &per_cpu(init_tss, cpu); /* never put a printk in __switch_to... printk() calls wake_up*() indirectly */ - - __unlazy_fpu(prev_p); + if (prev_p->thread_info->status & TS_USEDFPU) { + __save_init_fpu(prev_p); + if (tsk_used_math(next_p)) + init_fpu(next_p); + restore_fpu(next_p); + next_p->thread_info->status |= TS_USEDFPU; + } else { + stts(); + next_p->thread_info->status &= ~TS_USEDFPU; + } /* * Reload esp0, LDT and the page table pointer: @@ -710,13 +718,13 @@ struct task_struct fastcall * __switch_t * Now maybe reload the debug registers */ if (unlikely(next->debugreg[7])) { - set_debugreg(current->thread.debugreg[0], 0); - set_debugreg(current->thread.debugreg[1], 1); - set_debugreg(current->thread.debugreg[2], 2); - set_debugreg(current->thread.debugreg[3], 3); + set_debugreg(next->debugreg[0], 0); + set_debugreg(next->debugreg[1], 1); + set_debugreg(next->debugreg[2], 2); + set_debugreg(next->debugreg[3], 3); /* no 4 and 5 */ - set_debugreg(current->thread.debugreg[6], 6); - set_debugreg(current->thread.debugreg[7], 7); + set_debugreg(next->debugreg[6], 6); + set_debugreg(next->debugreg[7], 7); } if (unlikely(prev->io_bitmap_ptr || next->io_bitmap_ptr)) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
> > We do have the `used_math' optimisation in there which attempts to avoid > > doing the FP save/restore if the app isn't actually using math. > > No, it's more than that. There's a per-processor "used_math" flag to > determine if we need to _initialize_ the FPU, but on context switches we > always assume the program we're switching to will _not_ use FP, and we > just set the "fault on FP" flag and do not normally restore FP state. This shows room for optimization; if an app is consistently faulting to use FP after a context switch, in principle the kernel could start to assume that it will in the next timeslice as well. > On the other hand, I also wouldn't be surprised if glibc (or similar I doubt glibc is normally, at least most distros don't ship an SSE enabled glibc, only an "i686" one. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Fri, 22 Jul 2005 at13:27:56 +1000, Andrew Morton wrote: > hm. What context switch rate is that thing doing? 23000 - 25000 / second. I guess that explains why my attempt to duplicate this in C failed -- it switches at 15-19 times per second: /* i387 context switch benchmark *compile this program without optimization */ #include #include #include #include #include #include int lo = 0, fp = 0, af = 0; long i, iters; pid_t child; cpu_set_t cpuset; void handler(int sig) { lo = 0; /* stop looping */ } struct sigaction sa = { .sa_handler = handler, }; main(int argc, char *argv[]) { if (argc < 2) /* mandatory loop iteration count */ exit(1); iters = atol(argv[1]); if (argc > 2) /* loop in parent while waiting for child to exit */ lo = atoi(argv[2]); if (argc > 3) /* use FP while wait-looping */ fp = atoi(argv[3]); if (argc > 4) /* use cpu affinity -- all code runs on cpu 0 */ af = atoi(argv[4]); __CPU_SET(0, &cpuset); child = fork(); if (child) { /* parent */ if (af) sched_setaffinity(0, &cpuset); if (lo) { sigaction(SIGCHLD, &sa, NULL); if (fp) __asm__ __volatile__("fld1" "\n\t" "fldz"); while (lo) if (fp) __asm__ __volatile__( "fadd %st(1), %st(0)"); } else wait(NULL); } else { /* child */ if (af) sched_setaffinity(0, &cpuset); __asm__ __volatile__("fld1" "\n\t" "fldz"); for (i = 0; i < iters; i++) __asm__ __volatile__("fadd %st(1), %st(0)"); } } > Is the benchmark actually doing floating point stuff? They only release .class files. ISTR java does FP math even when developers think they are doing integer math but I can't find any references. __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Thu, Jul 21, 2005 at 11:06:22PM -0400, Chuck Ebbert wrote: > > This patch makes restore_fpu() an inline. When L1/L2 cache are saturated > it makes a measurable difference. > > Results from profiling Volanomark follow. Sample rate was 2000 samples/sec > (HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon. > > > Before: > > 10680 restore_fpu 333.7500 > 8351 device_not_available 203.6829 > 3823 math_state_restore59.7344 > - > 22854 > > > After: > > 12534 math_state_restore 130.5625 > 8354 device_not_available 203.7561 > - > 20888 > > > Patch is "obviously correct" and cuts 9% of the overhead. Please apply. >... If this patch makes a difference, could you do me a favour and check whether replacing the current cpu_has_fxsr #define in include/asm-i386/cpufeature.h with #define cpu_has_fxsr 1 on top of your patch brings an additional improvement? > Chuck TIA Adrian -- "Is there not promise of rain?" Ling Tan asked suddenly out of the darkness. There had been need of rain for many days. "Only a promise," Lao Er said. Pearl S. Buck - Dragon Seed - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
On Fri, 22 Jul 2005, Andrew Morton wrote: > > Is the benchmark actually doing floating point stuff? It must be. We still do lazy FP saves. > We do have the `used_math' optimisation in there which attempts to avoid > doing the FP save/restore if the app isn't actually using math. No, it's more than that. There's a per-processor "used_math" flag to determine if we need to _initialize_ the FPU, but on context switches we always assume the program we're switching to will _not_ use FP, and we just set the "fault on FP" flag and do not normally restore FP state. It seems volanomark will always use FP, if this is the hot path. We'll only save the FP context if the thread has used the FP in _that_ particular time-slice (TS_USEDFPU). As to why volanomark also uses FP, I don't know. I wouldn't be surprised if the benchmark was designed by somebody to not benefit from the x87 state save optimization. On the other hand, I also wouldn't be surprised if glibc (or similar system libraries) is over-eagerly using things like SSE instructions for memcopies etc, not realizing that they can have serious downsides. I don't see why volanomark would use FP, but hey, it's billed as a java VM and thread torture test for "chatrooms". Whatever. Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [patch 2.6.13-rc3a] i386: inline restore_fpu
Chuck Ebbert <[EMAIL PROTECTED]> wrote: > > > This patch makes restore_fpu() an inline. When L1/L2 cache are saturated > it makes a measurable difference. > > Results from profiling Volanomark follow. Sample rate was 2000 samples/sec > (HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon. > > > Before: > > 10680 restore_fpu 333.7500 > 8351 device_not_available 203.6829 > 3823 math_state_restore59.7344 > - > 22854 > > > After: > > 12534 math_state_restore 130.5625 > 8354 device_not_available 203.7561 > - > 20888 > > > Patch is "obviously correct" and cuts 9% of the overhead. Please apply. hm. What context switch rate is that thing doing? Is the benchmark actually doing floating point stuff? We do have the `used_math' optimisation in there which attempts to avoid doing the FP save/restore if the app isn't actually using math. But there's code in glibc startup which always does a bit of float, so that optimisation is always defeated. There was some discussion about periodically setting tasks back into !used_math state to try to restore the optimisation for tasks which only do a little bit of FP, but nothing actually got done. > Next step should be to physically place math_state_restore() after > device_not_available(). Would such a patch be accepted? (Yes it > would be ugly and require linker script changes.) Depends on the benefit/ugly ratio ;) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[patch 2.6.13-rc3a] i386: inline restore_fpu
This patch makes restore_fpu() an inline. When L1/L2 cache are saturated it makes a measurable difference. Results from profiling Volanomark follow. Sample rate was 2000 samples/sec (HZ = 250, profile multiplier = 8) on a dual-processor Pentium II Xeon. Before: 10680 restore_fpu 333.7500 8351 device_not_available 203.6829 3823 math_state_restore59.7344 - 22854 After: 12534 math_state_restore 130.5625 8354 device_not_available 203.7561 - 20888 Patch is "obviously correct" and cuts 9% of the overhead. Please apply. Next step should be to physically place math_state_restore() after device_not_available(). Would such a patch be accepted? (Yes it would be ugly and require linker script changes.) Signed-off-by: Chuck Ebbert <[EMAIL PROTECTED]> Index: 2.6.13-rc3a/arch/i386/kernel/i387.c === --- 2.6.13-rc3a.orig/arch/i386/kernel/i387.c +++ 2.6.13-rc3a/arch/i386/kernel/i387.c @@ -82,17 +82,6 @@ } EXPORT_SYMBOL_GPL(kernel_fpu_begin); -void restore_fpu( struct task_struct *tsk ) -{ - if ( cpu_has_fxsr ) { - asm volatile( "fxrstor %0" - : : "m" (tsk->thread.i387.fxsave) ); - } else { - asm volatile( "frstor %0" - : : "m" (tsk->thread.i387.fsave) ); - } -} - /* * FPU tag word conversions. */ Index: 2.6.13-rc3a/include/asm-i386/i387.h === --- 2.6.13-rc3a.orig/include/asm-i386/i387.h +++ 2.6.13-rc3a/include/asm-i386/i387.h @@ -22,11 +22,20 @@ /* * FPU lazy state save handling... */ -extern void restore_fpu( struct task_struct *tsk ); - extern void kernel_fpu_begin(void); #define kernel_fpu_end() do { stts(); preempt_enable(); } while(0) +static inline void restore_fpu( struct task_struct *tsk ) +{ + if ( cpu_has_fxsr ) { + asm volatile( "fxrstor %0" + : : "m" (tsk->thread.i387.fxsave) ); + } else { + asm volatile( "frstor %0" + : : "m" (tsk->thread.i387.fsave) ); + } +} + /* * These must be called with preempt disabled */ __ Chuck - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/