I am sponsoring this case for Rick Weisner. Requested release binding: Patch
Modified man pages are in the case's materials directory and diffs are at the end of this proposal. Template Version: @(#)sac_nextcase 1.70 03/30/10 SMI This information is Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. 1. Introduction 1.1. Project/Component Working Name: Performance Improvements for libmtmalloc 1.2. Name of Document Author/Supplier: Author: Rick Weisner 1.3 Date of This Document: 08 June, 2010 4. Technical Description Template Version: @(#)sac_nextcase 1.70 05/10/10 SMI This information is Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. 1. Introduction 1.1. Project/Component Working Name: Performance Improvements for libmtmalloc 1.2. Name of Document Author/Supplier: Author: Rick Weisner 1.3 Date of This Document: 01 June, 2010 4. Technical Description SUMMARY Under the following two situations libmtmalloc has shown poor scalability. 1. When there are large numbers of allocating threads. (see CR6922229) and 2. When the allocation size is larger than 64 KB. (see CR6555149) We will remedy the above scalability issues by: 1) Using atomic operations to eliminate the cache lock in libmtmalloc. 2) Provide a mechanism whereby the parent lock can also be eliminated for threads whose id is less than 2* the number of cpus. 3) Make the maximum cacheable requestsize tunable via an environment variable. BACKGROUND libmtmalloc organizes avaiable address space into buckets. Each thread which calls malloc is assigned a bucket based upon its thread id. The per bucket parent lock controls the use of each bucket. Each bucket is a list of caches based on size. Each list is protected by a cache lock. Applications with a large number of allocating threads may have their performance limited by contention for these locks. These sort of applications are not unusual in the Telco space. Larger allocations sizes are also becoming more common. With 64 bit applications, terabytes of memory, and hundreds of threads it is advantageous to be able to adjust the maximum cacheable requestsize to better suit the needs of the application. PROBLEM A customer's application did not perform as needed on a Netra 5440. DTrace indicated lock contention relating to memory allocation in libmtmalloc. The customer provided some code that provided dramatic performance increases by eliminating the "cache" locks and "parent" locks from libmtmalloc and replacing them with atomic operations. The customer's code was not threadsafe in general but was promising. In a different case the customer states: We observed that db is hitting oversize_lock mutex due to the memory needed to be allocated is more than MAX_CACHED. Sometimes acquiring the oversize_lock mutex is taking more than 2sec, causing the db performance to degrade. (see 6555149) PROPOSAL 1) Eliminate the cache lock by using atomic operations. 2) Add a new option to mallocctl(3MALLOC) that activates the use of exclusive buckets for threads whose ID is < 2 * the number of CPUs. The value argument associated with the mallocctl option is ignored. The use of exclusive buckets can also be activitated if there is an environment variable named MTEXCLUSIVE. This feature is needed for situations where the source code is unavailable. This feature will also assist in performance analysis. Once the option has been called there is no facility to 'unset' it. 3) Introduce the environment variable, MTMAXCACHE, which will set the maximum request size that is cached. It will have the values of 16 to 21. The default is 16 which means that requests less than 2^^16 are cached. With this value we can support up to 2mb (2^^21) request sizes in cache. If the value of MTMAXCACHE is set to something outside of the ranges then it will use either 16 or 21 (which ever bound has been broken by the value set). It is necessary to use an environment variable instead of a mallocctl interface because the MTMAXCACHE must be determined before malloc_init calls setup_caches. DETAILS The code has been developed and tested in 64 bit mode on Solaris 10 u6 on a Netra T5440. The test harness uses a configurable number of allocation threads, a configurable sample count, a configurable "maximum" allocation size. Each allocation thread has a configurable number of ramdom or fixed size allocations between 8 and the requested "max" allocation size + 1/2 the "max" allocation size. A freeing thread then releases the allocations while the allocating thread performs a fresh set of allocations. In initial testing with "stock" libmtmalloc it was possible to do 6300 64 bit operations per sec on the N5440. With the "atomic" library this increases to 15000. COMMENTS Exported Interfaces: MTEXCLUSIVE Committed option for mallocctl(3MALLOC). MTEXCLUSIVE Committed Shell environment variable. If set, then the effect is the same as if mallocctl was called with the option MTEXCLUSIVE. MTMAXCACHE Committed Shell environmet variable. If set, the value sets the maximum cachable requestsize to 2^^MTMAXCACHE. Reference: 6922229 libmtmalloc would benefit from atomic operations 6555149 poor performance with libmtmalloc compared to libc 6956786 Provide a tunable to tweak the MAX_CACHED threshold in libmtmalloc 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6. Resources and Schedule 6.4. Steering Committee requested information 6.4.1. Consolidation C-team Name: ON 6.5. ARC review type: FastTrack 6.6. ARC Exposure: open Man page diffs: ** libmtmalloc.man Thu Jun 3 15:46:52 2010 --- new_libmtmalloc.man Thu Jun 3 16:09:44 2010 *************** *** 28,34 **** --- 28,58 ---- mallocctl memalign realloc valloc + ENVIRONMENT VARIABLES + MTEXCLUSIVE By default, libmtmalloc allocates 2*NCPUS + buckets from which allocations occur. + threads share buckets based on their thread + id. If MTEXCLUSIVE is invoked, then 4*NCPUS + buckets are used. Threads with thread id less + than 2*NCPUS receive an exclusive bucket and + thus do not need to use locks. Allocation + performance for these buckets may be dramatically + increased. One enabled MTEXCLUSIVE can not be + disabled. This feature can be enabled by + setting the environment value MTEXCLUSIVE to + anything. Altenatively it can be enabled by + a call to mallocctl(see mallocctl). + MTMAXCACHE By default, allocations less than 2^^16 bytes + are allocated from buckets indexed by thread id. + Using this environment variable size of the + cached allocations can be increased to 2^^17, + 2^^18, 2^^18, 2^^19, 2^^20, or 2^^21 by + setting MTMAXCACHE to 17,18,19,20,or 21. + If MTMAXCACHE is set to less than 16 it is + reset to 16. If MTMAXCACHE is set to more than + 21, then it is reset to 21. This all occurs + silently. FILES /usr/lib/libmtmalloc.so.1 *** mallocctl.man Thu Jun 3 15:37:18 2010 --- new_mallocctl.man Thu Jun 3 15:45:41 2010 *************** *** 164,170 **** --- 164,183 ---- 256. The default value is 9. This value is multiplied by 8192. + MTEXCLUSIVE By default, libmtmalloc allocates 2*NCPUS + buckets from which allocations occur. + threads share buckets based on their thread + id. If MTEXCLUSIVE is invoked, then 4*NCPUS + buckets are used. Threads with thread id less + than 2*NCPUS receive an exclusive bucket and + thus do not need to use locks. Allocation + performance for these buckets may be dramatically + increased. One enabled MTEXCLUSIVE can not be + disabled. This feature can also be enabled by + setting the environment value MTEXCLUSIVE to + anything. + RETURN VALUES If there is no available memory, malloc(), realloc(), memalign(), and valloc() return a null pointer. When real- *************** *** 224,230 **** brk(2), getrlimit(2), bsdmalloc(3MALLOC), dlopen(3C), malloc(3C), malloc(3MALLOC), mapmalloc(3MALLOC), signal.h(3HEAD), umem_alloc(3MALLOC), watchmalloc(3MALLOC), ! attributes(5) WARNINGS Undefined results will occur if the size requested for a --- 237,243 ---- brk(2), getrlimit(2), bsdmalloc(3MALLOC), dlopen(3C), malloc(3C), malloc(3MALLOC), mapmalloc(3MALLOC), signal.h(3HEAD), umem_alloc(3MALLOC), watchmalloc(3MALLOC), ! libmtmalloc(3LIB), attributes(5) WARNINGS Undefined results will occur if the size requested for a _______________________________________________ opensolaris-arc mailing list opensolaris-arc@opensolaris.org