I am sponsoring this case for Rick Weisner.

Requested release binding: Patch

Modified man pages are in the case's materials directory and diffs 
are at the end of this proposal.


Template Version: @(#)sac_nextcase 1.70 03/30/10 SMI
This information is Copyright (c) 2010, Oracle and/or its affiliates. All 
rights reserved.
1. Introduction
    1.1. Project/Component Working Name:
         Performance Improvements for libmtmalloc
    1.2. Name of Document Author/Supplier:
         Author:  Rick Weisner
    1.3  Date of This Document:
        08 June, 2010
4. Technical Description
Template Version: @(#)sac_nextcase 1.70 05/10/10 SMI
This information is Copyright (c) 2010, Oracle and/or its affiliates. All 
rights reserved.
1. Introduction
    1.1. Project/Component Working Name:
         Performance Improvements for libmtmalloc
    1.2. Name of Document Author/Supplier:
         Author:  Rick Weisner
    1.3  Date of This Document:
        01 June, 2010
4. Technical Description

    SUMMARY

        Under the following two situations libmtmalloc has shown 
        poor scalability. 

        1. When there are large numbers of allocating threads.
           (see CR6922229)

           and 

        2. When the allocation size is larger than 64 KB.
           (see CR6555149)

        We will remedy the above scalability issues by:

        1) Using atomic operations to eliminate the cache lock in 
        libmtmalloc.

        2) Provide a mechanism whereby the parent lock can also
        be eliminated for threads whose id is less than 2* the number
        of cpus.

        3) Make the maximum cacheable requestsize tunable via an
        environment variable.

    BACKGROUND
        libmtmalloc organizes avaiable address space into buckets.
        Each thread which calls malloc is assigned a bucket based
        upon its thread id. The per bucket parent lock controls 
        the use of each bucket. Each bucket is a list of caches
        based on size. Each list is protected by a cache lock. 
        Applications with a large number of allocating threads may
        have their performance limited by contention for these locks.
        These sort of applications are not unusual in the Telco space.

        Larger allocations sizes are also becoming more common. With
        64 bit applications, terabytes of memory, and hundreds of
        threads it is advantageous  to be able to adjust the
        maximum cacheable requestsize to better suit the needs
        of the application.

    PROBLEM
        A customer's application did not perform as needed on a 
        Netra 5440. DTrace indicated lock contention relating to
        memory allocation in libmtmalloc. The customer provided 
        some code that provided dramatic performance increases by 
        eliminating the "cache" locks and "parent" locks from 
        libmtmalloc and replacing them with atomic operations.
        The customer's code was not threadsafe in general but was
        promising.

        In a different case the customer states:

        We observed that db is hitting oversize_lock mutex due to the
        memory needed to be allocated is more than MAX_CACHED.
        Sometimes acquiring the oversize_lock mutex is taking more 
        than 2sec, causing the db performance to degrade. (see 6555149)
        

    PROPOSAL

        1) Eliminate the cache lock by using atomic operations.

        2) Add a new option to mallocctl(3MALLOC) that activates
        the use of exclusive buckets for threads whose ID is < 2 *
        the number of CPUs. 

        The value argument associated with the mallocctl option is
        ignored. 

        The use of exclusive buckets can also be activitated if there
        is an environment variable named MTEXCLUSIVE.

        This feature is needed for situations where the source code is
        unavailable. This feature will also assist in performance
        analysis.

        Once the option has been called there is no facility
        to 'unset' it.  

        3) Introduce the environment variable, MTMAXCACHE, which will
        set the maximum request size that is cached. It will have the
        values of 16 to 21. The default is 16 which means that requests
        less than 2^^16 are cached. With this value we can support up to
        2mb (2^^21) request sizes in cache.

        If the value of MTMAXCACHE is set to something outside of the
        ranges then it will use either 16 or 21 (which ever bound
        has been broken by the value set).

        It is necessary to use an environment variable instead of
        a mallocctl interface because the MTMAXCACHE must be determined
        before malloc_init calls setup_caches.

    DETAILS

        The code has been developed and tested in 64 bit mode on 
        Solaris 10 u6 on a Netra T5440. The test harness uses a
        configurable number of allocation threads, a configurable
        sample count, a configurable "maximum" allocation size.
        Each allocation thread has a configurable number of ramdom 
        or fixed size allocations between 8 and the requested "max"
        allocation size + 1/2 the "max" allocation size.

        A freeing thread then releases the allocations while the 
        allocating thread performs a fresh set of allocations.

        In initial testing with "stock" libmtmalloc it was possible to do 
        6300 64 bit operations per sec on the N5440. With the "atomic" 
        library this increases to 15000.

    COMMENTS
        Exported Interfaces:

        MTEXCLUSIVE     Committed       option for mallocctl(3MALLOC).
        

        MTEXCLUSIVE     Committed       Shell environment variable. If set,
                                        then the effect is the same as if 
                                        mallocctl was called with the 
                                        option MTEXCLUSIVE.

        MTMAXCACHE      Committed       Shell environmet variable. If set,
                                        the value sets the maximum cachable
                                        requestsize to 2^^MTMAXCACHE.

        Reference:
        6922229 libmtmalloc would benefit from atomic operations
        6555149 poor performance with libmtmalloc compared to libc
        6956786 Provide a tunable to tweak the MAX_CACHED threshold
                in libmtmalloc

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open

Man page diffs:

** libmtmalloc.man     Thu Jun  3 15:46:52 2010
--- new_libmtmalloc.man Thu Jun  3 16:09:44 2010
***************
*** 28,34 ****
--- 28,58 ----
       mallocctl                     memalign
       realloc                       valloc
  
+ ENVIRONMENT VARIABLES
+      MTEXCLUSIVE       By default, libmtmalloc allocates 2*NCPUS
+                        buckets from which allocations occur. 
+                        threads share buckets based on their thread 
+                        id. If MTEXCLUSIVE is invoked, then 4*NCPUS
+                        buckets are used. Threads with thread id less 
+                        than 2*NCPUS receive an exclusive bucket and
+                        thus do not need to use locks. Allocation
+                        performance for these buckets may be dramatically
+                        increased. One enabled MTEXCLUSIVE can not be
+                        disabled. This feature can be enabled by
+                        setting the environment value MTEXCLUSIVE to
+                        anything. Altenatively it can be enabled by 
+                        a call to mallocctl(see mallocctl).
  
+      MTMAXCACHE        By default, allocations less than  2^^16 bytes 
+                      are allocated from buckets indexed by thread id.
+                        Using this environment variable size of the 
+                      cached allocations can be increased to 2^^17,
+                      2^^18, 2^^18, 2^^19, 2^^20, or 2^^21 by 
+                        setting MTMAXCACHE to 17,18,19,20,or 21.
+                        If MTMAXCACHE is set to less than 16 it is
+                        reset to 16. If MTMAXCACHE is set to more than
+                        21, then it is reset to 21. This all occurs
+                        silently.
  FILES
       /usr/lib/libmtmalloc.so.1


*** mallocctl.man       Thu Jun  3 15:37:18 2010
--- new_mallocctl.man   Thu Jun  3 15:45:41 2010
***************
*** 164,170 ****
--- 164,183 ----
                         256. The default value is  9.  This  value
                         is multiplied by 8192.
      
+      MTEXCLUSIVE       By default, libmtmalloc allocates 2*NCPUS
+                        buckets from which allocations occur. 
+                      threads share buckets based on their thread 
+                        id. If MTEXCLUSIVE is invoked, then 4*NCPUS
+                        buckets are used. Threads with thread id less 
+                        than 2*NCPUS receive an exclusive bucket and
+                        thus do not need to use locks. Allocation
+                        performance for these buckets may be dramatically
+                        increased. One enabled MTEXCLUSIVE can not be
+                        disabled. This feature can also be enabled by
+                        setting the environment value MTEXCLUSIVE to
+                        anything.
  
+ 
  RETURN VALUES
       If  there  is  no  available  memory,  malloc(),  realloc(),
       memalign(),  and  valloc() return a null pointer. When real-
***************
*** 224,230 ****
       brk(2),   getrlimit(2),   bsdmalloc(3MALLOC),    dlopen(3C),
       malloc(3C),       malloc(3MALLOC),       mapmalloc(3MALLOC),
       signal.h(3HEAD), umem_alloc(3MALLOC),  watchmalloc(3MALLOC),
!      attributes(5)
  
  WARNINGS
       Undefined results will occur if the  size  requested  for  a
--- 237,243 ----
       brk(2),   getrlimit(2),   bsdmalloc(3MALLOC),    dlopen(3C),
       malloc(3C),       malloc(3MALLOC),       mapmalloc(3MALLOC),
       signal.h(3HEAD), umem_alloc(3MALLOC),  watchmalloc(3MALLOC),
!      libmtmalloc(3LIB), attributes(5)
  
  WARNINGS
       Undefined results will occur if the  size  requested  for  a

_______________________________________________
opensolaris-arc mailing list
opensolaris-arc@opensolaris.org

Reply via email to