On Thu, 14 Jun 2007, Matthew Dillon wrote:

   I'm going to throw a wrench in the works, because it all gets turned
   around the moment you find yourself in a SMP environment where several
   threads are running on different cpus at the same time, using the
   same shared VM space.

   The moment you have a situation like that where you are futzing with
   the page tables, i.e. using mmap() for demand-zero and munmap() to
   free, the operation becomes extremely expensive verses anything
   else because any update to the page table (specifically any removal
   of page table entries from the page table) requires a SMP synchronization
   to occur between all the cpu's actively sharing that VM space, and
   that's on top of the overhead of taking the page fault(s).

   This is true of any memory mapping the kernel has to do in kernel
   virtual memory (must be synchronized with ALL cpus) and any mapping
   the kernel does on behalf of userland for user memory (must be
   synchronized with any cpu's actively using that VM space, i.e. threaded
   user programs).  The synchronization is required to properly invalidate
   stale mappings on other cpus and it must be done synchronously due
   to bugs in Intel/AMD related to changing page table entries on one
   cpu when instructions are executing using that memory on another cpu.
   There is no way to avoid it without tripping up on the Intel/AMD hardware
   bugs.

   From this point of view it is much, much better to bzero() memory that
   is already mapped then it is to map/unmap new memory.  I recently
   audited DragonFly and found an insane number of IPIs flying about due
   to PAGE_SIZE'd kernel mallocs using the VM trick via kernel_map &
   kmem_alloc().  They all went away when I made the kernel malloc use
   the slab cache for allocations up to and including PAGE_SIZE*2 bytes.

   Fun, eh?

                                        -Matt
                                        Matthew Dillon
                                        <[EMAIL PROTECTED]>

I have no intention of using malloc/calloc with free, and then repeating the 
same procedure. It's better just to use the memory allocated, if possible, size 
permitting this.

I wasn't thinking that closely though (ISA/hardware config versus OS 
implementation), but I had my suspicions since the AMD64 architecture is very 
different from the PowerPC architecture, in terms of word size, sychronization 
schemes, instruction count, etc.

Interesting insight though. Thanks :).

-Garrett

_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to