On Sun, Mar 22, 2015 at 09:41:53AM +1100, Bruce Evans wrote:
> On Sat, 21 Mar 2015, John Baldwin wrote:
> 
> > On 3/21/15 12:35 PM, Konstantin Belousov wrote:
> X static int
> X popcnt_pc_map_elem(uint64_t elem)
> X {
> X     int count;
> X 
> X     /*
> X      * This simple method of counting the one bits performs well because
> X      * the given element typically contains more zero bits than one bits.
> X      */
> X     count = 0;
> X     for (; elem != 0; elem &= elem - 1)
> X             count++;
> X     return (count);
> X }
> 
...
> 
> Perhaps this appeared to perform well not for the reason stated in its
> comment, but because it was tested with excessive CFLAGS, and the compiler
> actually reduced it to popcntq, and the test machine didn't have the
> bug.
No, it probably did not, since popcntq does not exist on the generic
amd64 machine.

> 
> In a game program that I wrote, efficiency of popcount() was actually
> important since it was used in inner loops.  I used lookup tables for
> popcount() and a couple of application-specific functions and even for
> fls().  This seemed to be better than the the existing bsf instruction,
> so it can't be bad for popcount().  The lookup was limited to bitsets
> of length 13 in the low bits of an int, so the table sizes were only
> 2**13 = 4096.  bitsets of length 64 would need multiple steps.  However,
> if the above performs well without compiler optimizations, then it must
> be because usually only the lowest few bits are set.  Then it would
> perform even better with even smaller (8-bit) tables combined with an
> early exit:
> 
>       count = 0;
>       for (; elem != 0; elem >>= 8)
>               count += lookup[elem &0xff];
>       return (count);
> 
> Many variations on this are possible.  E.g., use fls() to decide where to
> start; or avoid all branches, and add up the results in parallel.  For
> bitcount32, the latter is:
> 
>       bitcount64(x) := bitcount32(x & 0xffffffff) + bitcount32(x >> 32);
>       bitcount32(x) := bitcount16(x & 0xffff) + bitcount16(x >> 16);
>       bitcount16(x) := bitcount8(x & 0xff) + bitcount8(x >> 8);
>       bitcount8(x) := lookup[x];
> 
> Compilers won't be able to optimize the lookup methods.  They might be
> able to convert the current bitcount*() to popcnt*.  Last time I looked,
> clang but not gcc converted very complicated expressions for bswap*()
> to bswap* instructions.
> 
> >>> There's no way we can preemptively locate every bit of C that clang might
> >>> decide to replace with popcount and fix them to employ a workaround.  I 
> >>> think
> >>> the only viable solution is to use "-mno-popcount" on all compilers that 
> >>> aren't
> >>> known to be fixed.
> 
> Why bother?  It is only a short-lived and small(?) efficiency bug.
> 
> >> Both you and Bruce and Jilles state this, but I am unable to understand the
> >> argument about the compiler.  Can somebody explain this to me, please ?
> 
> It is also to avoid complications for short-lived optimizations.  The
> 3 uses of popcntq() in amd64 pmap cost:
> - 9 lines of inline asm.
> - 19 lines for the C version
> - 9 lines instead of 3 for the 3 uses
> When popcntq is not available, the result is a pessimization since the
> optimizations are not complicated enough to be useful in that case.
popcnt is not available for machines older than Nehalems.
Nobody cares about the last bits of the speed for such CPUs.

> They give the runtime overhead of a branch to call the C version that
> is presumably worse than using the existing bitcount32() (possibly
> twice) or the compiler builtin.
>    (jhb's cleanups didn't bother to optimize to use the builtin in
>    all cases, since it is hard to tell if the builtin is any good if
>    it is in software.  If you remove the 19 lines for
>    popcnt_pc_map_elem() and and replace calls to it by __builtin_popcountl(),
>    then the results would be:
>    - compile-time failure for old unsupported compilers that don't have this
>      builtin
>    - usually, a link-time failure for gcc-4.2 through gcc-4.8.  gcc
>      generates a call to __popcountdi2 unless CFLAGS enables the hardware
>      instruction.  __popcountdi2 is in libgcc but not in libkern.
>    - usually, 3 inline copies of the same code that would be produced if
>      FreeBSD's bitcount64() were used for clang.  clang unrolls things
>      excessively, giving enormous code.  Unrolling allows it to load the
>      large constants in __bitcount64() only once, but large code is still
>      needed to use these constants.
>    - runtime failures in misconfigured cases where CFLAGS doesn't match
>      the hardware.  Both gcc-4.8 and clang produce popcntq.  The runtime
>      check doesn't help since the compilers produce popcntq for the C
>      case.  Using __builtin_popcountl() asks for this directly.  Using
>      __bitcount64() asks for it indirectly, and gets it for clang.
>      Using popcnt_pc_map_elem() may avoid getting it for clang, but
>      apparentely still gets it.)
> When popcntq is available, the test to decide whether to use it is a
> small pessimization.  It might take longer than a branch-free software
> method.  This is unlikely since there are 3 popcounts per test.
Yes, modulo the excessive dependency CPU bug.

> 
> > If you compile with -mpopcount (or a march that includes it like 
> > -march=corei7)
> > then the compiler will assume that popcount is available and will use it 
> > without
> > doing any runtime checks.  In the case of my sandybridge desktop, I have
> > CPUTYPE set to "corei7-avx" in /etc/make.conf which adds "-march=corei7-avx"
> > to CFLAGS, so the compiler assumes it can use POPCOUNT (as well as newer
> > SSE and AVX).  In this case the compiler recognized the pattern in the C
> > function above as doing a population count and replaced the entire function 
> > with
> > a POPCOUNT instruction even though the C code doesn't explicitly try to use 
> > it.
> 
> The runtime error for the unsupported instruction is probably not important,
> since using misconfigured CFLAGS asks for problems.  In general, any new
> instruction may trap, and the mismatch must be small for only popcntq to
> trap.
Most new instructions are in AVX or similar extensions, which are disabled
for kernel.  popcnt is one of the few useful instruction set addition for
the general purpose algorithms, probably the only other exception is rdrand.

> I think the runtime slowness from a buggy instruction is also unimportant.
> 
^^^^^ This and this:
> 
> Always using new API would lose the micro-optimizations given by the runtime
> decision for default CFLAGS (used by distributions for portability).  To
> keep them, it seems best to keep the inline asm but replace
> popcnt_pc_map_elem(elem) by __bitcount64(elem).  -mno-popcount can then
> be used to work around slowness in the software (that is actually
> hardware) case.

So anybody has to compile his own kernel to get popcnt optimization ?
We do care about trivial things that improve time.

BTW, I have the following WIP change, which popcnt xorl is a piece of.
It emulates the ifuncs with some preprocessing mess.  It is much better
than runtime patching, and is a prerequisite to properly support more
things, like SMAP.  I did not published it earlier, since I wanted to
convert TLB flush code to this.

This can be converted even more to emulate the real ifuncs, by grouping
the real function' selection code into the method which is initially
stored in the _selector.  I tried this, but IMO without linker support,
it results in more clutter without a gain.

diff --git a/sys/amd64/amd64/fpu.c b/sys/amd64/amd64/fpu.c
index f30c073..135cb85 100644
--- a/sys/amd64/amd64/fpu.c
+++ b/sys/amd64/amd64/fpu.c
@@ -59,6 +59,7 @@ __FBSDID("$FreeBSD$");
 #include <machine/specialreg.h>
 #include <machine/segments.h>
 #include <machine/ucontext.h>
+#include <x86/ifunc.h>
 
 /*
  * Floating point support.
@@ -149,24 +150,35 @@ struct xsave_area_elm_descr {
        u_int   size;
 } *xsave_area_desc;
 
-void
-fpusave(void *addr)
+DEFINE_STATIC_IFUNC(void, fpusave, (void *));
+DEFINE_STATIC_IFUNC(void, fpurestore, (void *));
+
+static void
+fpusave_xsave(void *addr)
 {
 
-       if (use_xsave)
-               xsave((char *)addr, xsave_mask);
-       else
-               fxsave((char *)addr);
+       xsave((char *)addr, xsave_mask);
 }
 
-void
-fpurestore(void *addr)
+static void
+fpurestore_xrstor(void *addr)
 {
 
-       if (use_xsave)
-               xrstor((char *)addr, xsave_mask);
-       else
-               fxrstor((char *)addr);
+       xrstor((char *)addr, xsave_mask);
+}
+
+static void
+fpusave_fxsave(void *addr)
+{
+
+       fxsave((char *)addr);
+}
+
+static void
+fpurestore_fxrstor(void *addr)
+{
+
+       fxrstor((char *)addr);
 }
 
 void
@@ -208,8 +220,14 @@ fpuinit_bsp1(void)
                use_xsave = 1;
                TUNABLE_INT_FETCH("hw.use_xsave", &use_xsave);
        }
-       if (!use_xsave)
+       if (!use_xsave) {
+               fpusave_selector = fpusave_fxsave;
+               fpurestore_selector = fpurestore_fxrstor;
                return;
+       }
+
+       fpusave_selector = fpusave_xsave;
+       fpurestore_selector = fpurestore_xrstor;
 
        cpuid_count(0xd, 0x0, cp);
        xsave_mask = XFEATURE_ENABLED_X87 | XFEATURE_ENABLED_SSE;
diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 6a4077c..f6fbc33 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -138,6 +138,7 @@ __FBSDID("$FreeBSD$");
 
 #include <machine/intr_machdep.h>
 #include <x86/apicvar.h>
+#include <x86/ifunc.h>
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
@@ -404,6 +405,9 @@ SYSCTL_PROC(_vm_pmap, OID_AUTO, pcid_save_cnt, CTLTYPE_U64 
| CTLFLAG_RW |
     CTLFLAG_MPSAFE, NULL, 0, pmap_pcid_save_cnt_proc, "QU",
     "Count of saved TLB context on switch");
 
+DEFINE_GLOBAL_IFUNC(void, pmap_invalidate_cache_range,
+    (vm_offset_t sva, vm_offset_t eva));
+
 /*
  * Crashdump maps.
  */
@@ -413,6 +417,7 @@ static void free_pv_chunk(struct pv_chunk *pc);
 static void    free_pv_entry(pmap_t pmap, pv_entry_t pv);
 static pv_entry_t get_pv_entry(pmap_t pmap, struct rwlock **lockp);
 static int     popcnt_pc_map_elem(uint64_t elem);
+static int     popcnt_pc_map_elem_pq(uint64_t elem);
 static vm_page_t reclaim_pv_chunk(pmap_t locked_pmap, struct rwlock **lockp);
 static void    reserve_pv_entries(pmap_t pmap, int needed,
                    struct rwlock **lockp);
@@ -438,6 +443,10 @@ static vm_page_t pmap_enter_quick_locked(pmap_t pmap, 
vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte, struct rwlock **lockp);
 static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte);
 static int pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte);
+static void pmap_invalidate_cache_range_selfsnoop(vm_offset_t sva,
+    vm_offset_t eva);
+static void pmap_invalidate_cache_range_all(vm_offset_t sva,
+    vm_offset_t eva);
 static void pmap_kenter_attr(vm_offset_t va, vm_paddr_t pa, int mode);
 static vm_page_t pmap_lookup_pt_page(pmap_t pmap, vm_offset_t va);
 static void pmap_pde_attr(pd_entry_t *pde, int cache_bits, int mask);
@@ -854,6 +863,16 @@ pmap_bootstrap(vm_paddr_t *firstaddr)
        if (cpu_stdext_feature & CPUID_STDEXT_SMEP)
                load_cr4(rcr4() | CR4_SMEP);
 
+       if ((cpu_feature & CPUID_SS) != 0)
+               pmap_invalidate_cache_range_selector =
+                   pmap_invalidate_cache_range_selfsnoop;
+       else if ((cpu_feature & CPUID_CLFSH) != 0)
+               pmap_invalidate_cache_range_selector =
+                   pmap_force_invalidate_cache_range;
+       else
+               pmap_invalidate_cache_range_selector =
+                   pmap_invalidate_cache_range_all;
+
        /*
         * Initialize the kernel pmap (which is statically allocated).
         */
@@ -1729,24 +1748,22 @@ pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t 
*pde, pd_entry_t newpde)
 
 #define PMAP_CLFLUSH_THRESHOLD   (2 * 1024 * 1024)
 
-void
-pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva, boolean_t force)
+static void
+pmap_invalidate_cache_range_selfsnoop(vm_offset_t sva, vm_offset_t eva)
 {
 
-       if (force) {
-               sva &= ~(vm_offset_t)cpu_clflush_line_size;
-       } else {
-               KASSERT((sva & PAGE_MASK) == 0,
-                   ("pmap_invalidate_cache_range: sva not page-aligned"));
-               KASSERT((eva & PAGE_MASK) == 0,
-                   ("pmap_invalidate_cache_range: eva not page-aligned"));
-       }
+       KASSERT((sva & PAGE_MASK) == 0,
+           ("pmap_invalidate_cache_range: sva not page-aligned"));
+       KASSERT((eva & PAGE_MASK) == 0,
+           ("pmap_invalidate_cache_range: eva not page-aligned"));
+}
 
-       if ((cpu_feature & CPUID_SS) != 0 && !force)
-               ; /* If "Self Snoop" is supported and allowed, do nothing. */
-       else if ((cpu_feature & CPUID_CLFSH) != 0 &&
-           eva - sva < PMAP_CLFLUSH_THRESHOLD) {
+void
+pmap_force_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
+{
 
+       sva &= ~(vm_offset_t)cpu_clflush_line_size;
+       if (eva - sva < PMAP_CLFLUSH_THRESHOLD) {
                /*
                 * XXX: Some CPUs fault, hang, or trash the local APIC
                 * registers if we use CLFLUSH on the local APIC
@@ -1768,16 +1785,22 @@ pmap_invalidate_cache_range(vm_offset_t sva, 
vm_offset_t eva, boolean_t force)
                        clflush(sva);
                mfence();
        } else {
-
                /*
-                * No targeted cache flush methods are supported by CPU,
-                * or the supplied range is bigger than 2MB.
+                * The supplied range is bigger than 2MB.
                 * Globally invalidate cache.
                 */
                pmap_invalidate_cache();
        }
 }
 
+static void
+pmap_invalidate_cache_range_all(vm_offset_t sva, vm_offset_t eva)
+{
+
+       pmap_invalidate_cache_range_selfsnoop(sva, eva);
+       pmap_invalidate_cache();
+}
+
 /*
  * Remove the specified set of pages from the data and instruction caches.
  *
@@ -2997,6 +3020,29 @@ popcnt_pc_map_elem(uint64_t elem)
 }
 
 /*
+ * The erratas for Intel processors state that "POPCNT Instruction May
+ * Take Longer to Execute Than Expected".  It is believed that the
+ * issue is the spurious dependency on the destination register.
+ * Provide a hint to the register rename logic that the destination
+ * value is overwritten, by clearing it, as suggested in the
+ * optimization manual.  It should be cheap for unaffected processors
+ * as well.
+ *
+ * Reference numbers for erratas are
+ * 4th Gen Core: HSD146
+ * 5th Gen Core: BDM85
+ */
+static int
+popcnt_pc_map_elem_pq(uint64_t elem)
+{
+       u_long result;
+
+       __asm __volatile("xorl %k0,%k0;popcntq %1,%0"
+           : "=&r" (result) : "rm" (elem));
+       return (result);
+}
+
+/*
  * Ensure that the number of spare PV entries in the specified pmap meets or
  * exceeds the given count, "needed".
  *
@@ -3029,9 +3075,9 @@ retry:
                        free += popcnt_pc_map_elem(pc->pc_map[1]);
                        free += popcnt_pc_map_elem(pc->pc_map[2]);
                } else {
-                       free = popcntq(pc->pc_map[0]);
-                       free += popcntq(pc->pc_map[1]);
-                       free += popcntq(pc->pc_map[2]);
+                       free = popcnt_pc_map_elem_pq(pc->pc_map[0]);
+                       free += popcnt_pc_map_elem_pq(pc->pc_map[1]);
+                       free += popcnt_pc_map_elem_pq(pc->pc_map[2]);
                }
                if (free == 0)
                        break;
@@ -6204,7 +6250,7 @@ pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode)
        for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE)
                pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode);
        pmap_invalidate_range(kernel_pmap, va, va + tmpsize);
-       pmap_invalidate_cache_range(va, va + tmpsize, FALSE);
+       pmap_invalidate_cache_range(va, va + tmpsize);
        return ((void *)(va + offset));
 }
 
@@ -6540,7 +6586,7 @@ pmap_change_attr_locked(vm_offset_t va, vm_size_t size, 
int mode)
         */
        if (changed) {
                pmap_invalidate_range(kernel_pmap, base, tmpva);
-               pmap_invalidate_cache_range(base, tmpva, FALSE);
+               pmap_invalidate_cache_range(base, tmpva);
        }
        return (error);
 }
diff --git a/sys/amd64/include/pmap.h b/sys/amd64/include/pmap.h
index 868db7d..79a0b32 100644
--- a/sys/amd64/include/pmap.h
+++ b/sys/amd64/include/pmap.h
@@ -394,8 +394,8 @@ void        pmap_invalidate_range(pmap_t, vm_offset_t, 
vm_offset_t);
 void   pmap_invalidate_all(pmap_t);
 void   pmap_invalidate_cache(void);
 void   pmap_invalidate_cache_pages(vm_page_t *pages, int count);
-void   pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva,
-           boolean_t force);
+void   pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva);
+void   pmap_force_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva);
 void   pmap_get_mapping(pmap_t pmap, vm_offset_t va, uint64_t *ptr, int *num);
 boolean_t pmap_map_io_transient(vm_page_t *, vm_offset_t *, int, boolean_t);
 void   pmap_unmap_io_transient(vm_page_t *, vm_offset_t *, int, boolean_t);
diff --git a/sys/dev/drm2/i915/intel_ringbuffer.c 
b/sys/dev/drm2/i915/intel_ringbuffer.c
index 2251220..6eb1d85 100644
--- a/sys/dev/drm2/i915/intel_ringbuffer.c
+++ b/sys/dev/drm2/i915/intel_ringbuffer.c
@@ -384,8 +384,8 @@ init_pipe_control(struct intel_ring_buffer *ring)
        if (pc->cpu_page == NULL)
                goto err_unpin;
        pmap_qenter((uintptr_t)pc->cpu_page, &obj->pages[0], 1);
-       pmap_invalidate_cache_range((vm_offset_t)pc->cpu_page,
-           (vm_offset_t)pc->cpu_page + PAGE_SIZE, FALSE);
+       pmap_force_invalidate_cache_range((vm_offset_t)pc->cpu_page,
+           (vm_offset_t)pc->cpu_page + PAGE_SIZE);
 
        pc->obj = obj;
        ring->private = pc;
@@ -968,8 +968,9 @@ static int init_status_page(struct intel_ring_buffer *ring)
        }
        pmap_qenter((vm_offset_t)ring->status_page.page_addr, &obj->pages[0],
            1);
-       pmap_invalidate_cache_range((vm_offset_t)ring->status_page.page_addr,
-           (vm_offset_t)ring->status_page.page_addr + PAGE_SIZE, FALSE);
+       pmap_force_invalidate_cache_range(
+           (vm_offset_t)ring->status_page.page_addr,
+           (vm_offset_t)ring->status_page.page_addr + PAGE_SIZE);
        ring->status_page.obj = obj;
        memset(ring->status_page.page_addr, 0, PAGE_SIZE);
 
diff --git a/sys/i386/i386/pmap.c b/sys/i386/i386/pmap.c
index 68b44e9..58db621 100644
--- a/sys/i386/i386/pmap.c
+++ b/sys/i386/i386/pmap.c
@@ -143,6 +143,7 @@ __FBSDID("$FreeBSD$");
 #include <machine/intr_machdep.h>
 #include <x86/apicvar.h>
 #endif
+#include <x86/ifunc.h>
 #include <machine/cpu.h>
 #include <machine/cputypes.h>
 #include <machine/md_var.h>
@@ -263,6 +264,9 @@ caddr_t ptvmmap = 0;
 caddr_t CADDR3;
 struct msgbuf *msgbufp = 0;
 
+DEFINE_GLOBAL_IFUNC(void, pmap_invalidate_cache_range,
+    (vm_offset_t sva, vm_offset_t eva));
+
 /*
  * Crashdump maps.
  */
@@ -305,6 +309,10 @@ static vm_page_t pmap_enter_quick_locked(pmap_t pmap, 
vm_offset_t va,
     vm_page_t m, vm_prot_t prot, vm_page_t mpte);
 static void pmap_flush_page(vm_page_t m);
 static int pmap_insert_pt_page(pmap_t pmap, vm_page_t mpte);
+static void pmap_invalidate_cache_range_selfsnoop(vm_offset_t sva,
+    vm_offset_t eva);
+static void pmap_invalidate_cache_range_all(vm_offset_t sva,
+    vm_offset_t eva);
 static void pmap_fill_ptp(pt_entry_t *firstpte, pt_entry_t newpte);
 static boolean_t pmap_is_modified_pvh(struct md_page *pvh);
 static boolean_t pmap_is_referenced_pvh(struct md_page *pvh);
@@ -509,6 +517,16 @@ pmap_bootstrap(vm_paddr_t firstaddr)
 
        /* Turn on PG_G on kernel page(s) */
        pmap_set_pg();
+
+       if ((cpu_feature & CPUID_SS) != 0)
+               pmap_invalidate_cache_range_selector =
+                   pmap_invalidate_cache_range_selfsnoop;
+       else if ((cpu_feature & CPUID_CLFSH) != 0)
+               pmap_invalidate_cache_range_selector =
+                   pmap_force_invalidate_cache_range;
+       else
+               pmap_invalidate_cache_range_selector =
+                   pmap_invalidate_cache_range_all;
 }
 
 /*
@@ -1182,25 +1200,22 @@ pmap_update_pde(pmap_t pmap, vm_offset_t va, pd_entry_t 
*pde, pd_entry_t newpde)
 
 #define        PMAP_CLFLUSH_THRESHOLD  (2 * 1024 * 1024)
 
-void
-pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva, boolean_t force)
+static void
+pmap_invalidate_cache_range_selfsnoop(vm_offset_t sva, vm_offset_t eva)
 {
 
-       if (force) {
-               sva &= ~(vm_offset_t)cpu_clflush_line_size;
-       } else {
-               KASSERT((sva & PAGE_MASK) == 0,
-                   ("pmap_invalidate_cache_range: sva not page-aligned"));
-               KASSERT((eva & PAGE_MASK) == 0,
-                   ("pmap_invalidate_cache_range: eva not page-aligned"));
-       }
+       KASSERT((sva & PAGE_MASK) == 0,
+           ("pmap_invalidate_cache_range: sva not page-aligned"));
+       KASSERT((eva & PAGE_MASK) == 0,
+           ("pmap_invalidate_cache_range: eva not page-aligned"));
+}
 
-       if ((cpu_feature & CPUID_SS) != 0 && !force)
-               ; /* If "Self Snoop" is supported and allowed, do nothing. */
-       else if ((cpu_feature & CPUID_CLFSH) != 0 &&
-           eva - sva < PMAP_CLFLUSH_THRESHOLD) {
+void
+pmap_force_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva)
+{
 
-#ifdef DEV_APIC
+       sva &= ~(vm_offset_t)cpu_clflush_line_size;
+       if (eva - sva < PMAP_CLFLUSH_THRESHOLD) {
                /*
                 * XXX: Some CPUs fault, hang, or trash the local APIC
                 * registers if we use CLFLUSH on the local APIC
@@ -1209,7 +1224,7 @@ pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t 
eva, boolean_t force)
                 */
                if (pmap_kextract(sva) == lapic_paddr)
                        return;
-#endif
+
                /*
                 * Otherwise, do per-cache line flush.  Use the mfence
                 * instruction to insure that previous stores are
@@ -1222,16 +1237,22 @@ pmap_invalidate_cache_range(vm_offset_t sva, 
vm_offset_t eva, boolean_t force)
                        clflush(sva);
                mfence();
        } else {
-
                /*
-                * No targeted cache flush methods are supported by CPU,
-                * or the supplied range is bigger than 2MB.
+                * The supplied range is bigger than 2MB.
                 * Globally invalidate cache.
                 */
                pmap_invalidate_cache();
        }
 }
 
+static void
+pmap_invalidate_cache_range_all(vm_offset_t sva, vm_offset_t eva)
+{
+
+       pmap_invalidate_cache_range_selfsnoop(sva, eva);
+       pmap_invalidate_cache();
+}
+
 void
 pmap_invalidate_cache_pages(vm_page_t *pages, int count)
 {
@@ -5179,7 +5200,7 @@ pmap_mapdev_attr(vm_paddr_t pa, vm_size_t size, int mode)
        for (tmpsize = 0; tmpsize < size; tmpsize += PAGE_SIZE)
                pmap_kenter_attr(va + tmpsize, pa + tmpsize, mode);
        pmap_invalidate_range(kernel_pmap, va, va + tmpsize);
-       pmap_invalidate_cache_range(va, va + size, FALSE);
+       pmap_invalidate_cache_range(va, va + size);
        return ((void *)(va + offset));
 }
 
@@ -5385,7 +5406,7 @@ pmap_change_attr(vm_offset_t va, vm_size_t size, int mode)
         */
        if (changed) {
                pmap_invalidate_range(kernel_pmap, base, tmpva);
-               pmap_invalidate_cache_range(base, tmpva, FALSE);
+               pmap_invalidate_cache_range(base, tmpva);
        }
        return (0);
 }
diff --git a/sys/i386/i386/vm_machdep.c b/sys/i386/i386/vm_machdep.c
index ebd177a..a9e1bfe 100644
--- a/sys/i386/i386/vm_machdep.c
+++ b/sys/i386/i386/vm_machdep.c
@@ -876,7 +876,7 @@ sf_buf_invalidate(struct sf_buf *sf)
         * settings are recalculated.
         */
        pmap_qenter(sf->kva, &m, 1);
-       pmap_invalidate_cache_range(sf->kva, sf->kva + PAGE_SIZE, FALSE);
+       pmap_invalidate_cache_range(sf->kva, sf->kva + PAGE_SIZE);
 }
 
 /*
diff --git a/sys/i386/include/pmap.h b/sys/i386/include/pmap.h
index 05656cd..936c930 100644
--- a/sys/i386/include/pmap.h
+++ b/sys/i386/include/pmap.h
@@ -458,8 +458,8 @@ void        pmap_invalidate_range(pmap_t, vm_offset_t, 
vm_offset_t);
 void   pmap_invalidate_all(pmap_t);
 void   pmap_invalidate_cache(void);
 void   pmap_invalidate_cache_pages(vm_page_t *pages, int count);
-void   pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva,
-           boolean_t force);
+void   pmap_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva);
+void   pmap_force_invalidate_cache_range(vm_offset_t sva, vm_offset_t eva);
 
 #endif /* _KERNEL */
 
diff --git a/sys/i386/isa/npx.c b/sys/i386/isa/npx.c
index b0ea4e7..1dc9c3d 100644
--- a/sys/i386/isa/npx.c
+++ b/sys/i386/isa/npx.c
@@ -67,6 +67,7 @@ __FBSDID("$FreeBSD$");
 #include <machine/specialreg.h>
 #include <machine/segments.h>
 #include <machine/ucontext.h>
+#include <x86/ifunc.h>
 
 #include <machine/intr_machdep.h>
 #ifdef XEN
@@ -211,7 +212,6 @@ CTASSERT(X86_XSTATE_XCR0_OFFSET >= offsetof(struct savexmm, 
sv_pad) &&
 static void    fpu_clean_state(void);
 #endif
 
-static void    fpusave(union savefpu *);
 static void    fpurstor(union savefpu *);
 
 int    hw_float;
@@ -231,8 +231,6 @@ struct xsave_area_elm_descr {
        u_int   offset;
        u_int   size;
 } *xsave_area_desc;
-
-static int use_xsaveopt;
 #endif
 
 static volatile u_int          npx_traps_while_probing;
@@ -349,7 +347,37 @@ cleanup:
        return (hw_float);
 }
 
-#ifdef CPU_ENABLE_SSE
+DEFINE_STATIC_IFUNC(void, npxsave_core, (union savefpu *));
+DEFINE_STATIC_IFUNC(void, fpusave, (union savefpu *));
+
+static void
+npxsave_xsaveopt(union savefpu *addr)
+{
+
+       xsaveopt((char *)addr, xsave_mask);
+}
+
+static void
+fpusave_xsave(union savefpu *addr)
+{
+
+       xsave((char *)addr, xsave_mask);
+}
+
+static void
+fpusave_fxsave(union savefpu *addr)
+{
+
+       fxsave((char *)addr);
+}
+
+static void
+fpusave_fnsave(union savefpu *addr)
+{
+
+       fnsave((char *)addr);
+}
+
 /*
  * Enable XSAVE if supported and allowed by user.
  * Calculate the xsave_mask.
@@ -357,6 +385,7 @@ cleanup:
 static void
 npxinit_bsp1(void)
 {
+#ifdef CPU_ENABLE_SSE
        u_int cp[4];
        uint64_t xsave_mask_user;
 
@@ -364,8 +393,18 @@ npxinit_bsp1(void)
                use_xsave = 1;
                TUNABLE_INT_FETCH("hw.use_xsave", &use_xsave);
        }
-       if (!use_xsave)
+       if (!use_xsave) {
+               if (cpu_fxsr) {
+                       npxsave_core_selector = fpusave_fxsave;
+                       fpusave_selector = fpusave_fxsave;
+               } else {
+#endif
+                       npxsave_core_selector = fpusave_fnsave;
+                       fpusave_selector = fpusave_fnsave;
+#ifdef CPU_ENABLE_SSE
+               }
                return;
+       }
 
        cpuid_count(0xd, 0x0, cp);
        xsave_mask = XFEATURE_ENABLED_X87 | XFEATURE_ENABLED_SSE;
@@ -382,12 +421,13 @@ npxinit_bsp1(void)
                xsave_mask &= ~XFEATURE_MPX;
 
        cpuid_count(0xd, 0x1, cp);
-       if ((cp[0] & CPUID_EXTSTATE_XSAVEOPT) != 0)
-               use_xsaveopt = 1;
-}
+       npxsave_core_selector = (cp[0] & CPUID_EXTSTATE_XSAVEOPT) != 0 ?
+           npxsave_xsaveopt : fpusave_xsave;
+       fpusave_selector = fpusave_xsave;
 #endif
-/*
+}
 
+/*
  * Calculate the fpu save area size.
  */
 static void
@@ -426,9 +466,7 @@ npxinit(bool bsp)
        if (bsp) {
                if (!npx_probe())
                        return;
-#ifdef CPU_ENABLE_SSE
                npxinit_bsp1();
-#endif
        }
 
 #ifdef CPU_ENABLE_SSE
@@ -917,17 +955,11 @@ npxdna(void)
  * npxsave() atomically with checking fpcurthread.
  */
 void
-npxsave(addr)
-       union savefpu *addr;
+npxsave(union savefpu *addr)
 {
 
        stop_emulating();
-#ifdef CPU_ENABLE_SSE
-       if (use_xsaveopt)
-               xsaveopt((char *)addr, xsave_mask);
-       else
-#endif
-               fpusave(addr);
+       npxsave_core(addr);
        start_emulating();
        PCPU_SET(fpcurthread, NULL);
 }
@@ -1153,21 +1185,6 @@ npxsetregs(struct thread *td, union savefpu *addr, char 
*xfpustate,
        return (0);
 }
 
-static void
-fpusave(addr)
-       union savefpu *addr;
-{
-       
-#ifdef CPU_ENABLE_SSE
-       if (use_xsave)
-               xsave((char *)addr, xsave_mask);
-       else if (cpu_fxsr)
-               fxsave(addr);
-       else
-#endif
-               fnsave(addr);
-}
-
 #ifdef CPU_ENABLE_SSE
 /*
  * On AuthenticAMD processors, the fxrstor instruction does not restore
diff --git a/sys/x86/include/ifunc.h b/sys/x86/include/ifunc.h
new file mode 100644
index 0000000..a708104
--- /dev/null
+++ b/sys/x86/include/ifunc.h
@@ -0,0 +1,53 @@
+/*-
+ * Copyright (c) 2015 The FreeBSD Foundation
+ * All rights reserved.
+ *
+ * This software was developed by Konstantin Belousov <k...@freebsd.org>
+ * under sponsorship from the FreeBSD Foundation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ *
+ * $FreeBSD$
+ */
+
+#ifndef __X86_IFUNC_H
+#define        __X86_IFUNC_H
+
+#define        DECLARE_IFUNC(ret_type, name, args)                             
\
+ret_type name args
+
+#define        DEFINE_IFUNC(scope, selector_qual, ret_type, name, args)        
\
+__asm__ (scope "\t" #name "\n"                                         \
+        "\t.type\t" #name ",@function\n"                               \
+        #name ":\n"                                                    \
+        "\tjmp *" #name "_selector\n"                                  \
+        "\t.size\t" #name ",\t. - "#name);                             \
+selector_qual ret_type (*name##_selector)args  __used;                 \
+DECLARE_IFUNC(ret_type, name, args)
+
+#define        DEFINE_STATIC_IFUNC(ret_type, name, args)                       
\
+       DEFINE_IFUNC(".local", static, ret_type, name, args)
+
+#define        DEFINE_GLOBAL_IFUNC(ret_type, name, args)                       
\
+       DEFINE_IFUNC(".globl", , ret_type, name, args)
+
+#endif
diff --git a/sys/x86/iommu/intel_utils.c b/sys/x86/iommu/intel_utils.c
index f696f9d..1c96191 100644
--- a/sys/x86/iommu/intel_utils.c
+++ b/sys/x86/iommu/intel_utils.c
@@ -374,8 +374,7 @@ dmar_flush_transl_to_ram(struct dmar_unit *unit, void *dst, 
size_t sz)
         * If DMAR does not snoop paging structures accesses, flush
         * CPU cache to memory.
         */
-       pmap_invalidate_cache_range((uintptr_t)dst, (uintptr_t)dst + sz,
-           TRUE);
+       pmap_force_invalidate_cache_range((uintptr_t)dst, (uintptr_t)dst + sz);
 }
 
 void


_______________________________________________
svn-src-head@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-head
To unsubscribe, send any mail to "svn-src-head-unsubscr...@freebsd.org"

Reply via email to