Googling around copy_user_generic_unrolled() - a kernel space function - seen in my previous smbd profiling, I found what might be a clue for the performance drop. It is a comment on line #31 (see below) that says:
31 /* 32 * If CPU has ERMS feature, use copy_user_enhanced_fast_string. 33 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. 34 * Otherwise, use copy_user_generic_unrolled. 35 */ Which makes me guess that my Atom D2701 ( http://ark.intel.com/products/59683/Intel-Atom-Processor-D2700-1M-Cache-2_13-GHz) is not compiled with REP_GOOD nor ERMS. It is not clear to me if the processor does support those features, but apparently it does (looking at /proc/cpuinfo from another user's NAS - http://www.foxnetwork.ru/index.php/en/component/content/article/121-thecus-n4800eco.html ) __________________________________________________________________________________________ linux/arch/x86/include/asm/uaccess_64.h Toggle line number - Style: 1 #ifndef _ASM_X86_UACCESS_64_H 2 #define _ASM_X86_UACCESS_64_H 3 4 /* 5 * User space memory access functions 6 */ 7 #include <linux/compiler.h> 8 #include <linux/errno.h> 9 #include <linux/lockdep.h> 10 #include <asm/alternative.h> 11 #include <asm/cpufeature.h> 12 #include <asm/page.h> 13 14 /* 15 * Copy To/From Userspace 16 */ 17 18 /* Handles exceptions in both to and from, but doesn't do access_ok */ 19 __must_check unsigned long 20 copy_user_enhanced_fast_string(void *to, const void *from, unsigned len); 21 __must_check unsigned long 22 copy_user_generic_string(void *to, const void *from, unsigned len); 23 __must_check unsigned long 24 copy_user_generic_unrolled(void *to, const void *from, unsigned len); 25 26 static __always_inline __must_check unsigned long 27 copy_user_generic(void *to, const void *from, unsigned len) 28 { 29 unsigned ret; 30 31 /* 32 * If CPU has ERMS feature, use copy_user_enhanced_fast_string. 33 * Otherwise, if CPU has rep_good feature, use copy_user_generic_string. 34 * Otherwise, use copy_user_generic_unrolled. 35 */ 36 alternative_call_2(copy_user_generic_unrolled, 37 copy_user_generic_string, 38 X86_FEATURE_REP_GOOD, 39 copy_user_enhanced_fast_string, 40 X86_FEATURE_ERMS, 41 ASM_OUTPUT2(""=a"" (ret), ""=D"" (to), ""=S"" (from), 42 ""=d"" (len)), 43 ""1"" (to), ""2"" (from), ""3"" (len) 44 : ""memory"", ""rcx"", ""r8"", ""r9"", ""r10"", ""r11""); 45 return ret; 46 } On Tue, Oct 1, 2013 at 6:04 PM, Thiago Fernandes Crepaldi <togn...@gmail.com > wrote: > That is funny. Now that I replaced samba 4 and libc-2.13.so with debug > symbols, the perf profile seems to be have changed a bit after the same > tests ! > > Events: 54K cycles > - 3.06% smbd [kernel.kallsyms] [k] copy_user_generic_unrolled > - copy_user_generic_unrolled > 52.63% __read_nocancel > 36.20% __write_nocancel > 2.70% __getdents64 > 2.44% __libc_readv > + 2.00% do_fcntl > 0.87% __GI___libc_read > + 0.77% __fxstat64 > - 2.02% smbd libc-2.13.so [.] _int_malloc > + _int_malloc > - 1.62% smbd [kernel.kallsyms] [k] kmem_cache_alloc > + kmem_cache_alloc > - 1.22% smbd libtalloc.so.2.0.7 [.] _talloc_free > + _talloc_free > - 0.99% smbd libtalloc.so.2.0.7 [.] > _talloc_free_children_internal.isra.4 > + _talloc_free_children_internal.isra.4 > - 0.86% smbd libc-2.13.so [.] __memcpy_ssse3 > + __memcpy_ssse3 > + 0.81% smbd [kernel.kallsyms] [k] kmem_cache_free > + 0.81% smbd libc-2.13.so [.] _int_free > + 0.79% smbd [kernel.kallsyms] [k] __kmalloc > + 0.66% smbd libtalloc.so.2.0.7 [.] _talloc_zero > + 0.63% smbd [kernel.kallsyms] [k] link_path_walk > + 0.63% smbd [kernel.kallsyms] [k] ext4_htree_store_dirent > + 0.55% smbd libtalloc.so.2.0.7 [.] talloc_alloc_pool > + 0.55% smbd libc-2.13.so [.] __memset_sse2 > + 0.53% smbd libc-2.13.so [.] malloc > + 0.53% smbd [kernel.kallsyms] [k] fcntl_setlk > + 0.52% smbd [kernel.kallsyms] [k] get_page_from_freelist > + 0.50% smbd libtalloc.so.2.0.7 [.] talloc_get_name > + 0.50% smbd [kernel.kallsyms] [k] tg3_start_xmit > + 0.48% smbd [kernel.kallsyms] [k] memset > + 0.47% smbd libc-2.13.so [.] free > + 0.47% smbd [kernel.kallsyms] [k] _raw_spin_lock > + 0.45% smbd [kernel.kallsyms] [k] __d_lookup_rcu > + 0.45% smbd libc-2.13.so [.] __GI___strcmp_ssse3 > + 0.44% smbd libtalloc.so.2.0.7 [.] _talloc_get_type_abort > + 0.43% smbd [kernel.kallsyms] [k] system_call_after_swapgs > + 0.43% smbd [kernel.kallsyms] [k] ext4_mark_iloc_dirty > + 0.42% smbd libtalloc.so.2.0.7 [.] talloc_is_parent > + 0.41% smbd [kernel.kallsyms] [k] __alloc_skb > + 0.41% smbd [kernel.kallsyms] [k] __posix_lock_file > + 0.40% smbd [kernel.kallsyms] [k] __ext4_get_inode_loc > + 0.39% smbd libc-2.13.so [.] __strlen_sse2 > + 0.39% smbd [kernel.kallsyms] [k] kfree > + 0.39% smbd [kernel.kallsyms] [k] tcp_recvmsg > + 0.38% smbd libtalloc.so.2.0.7 [.] talloc_named_const > + 0.37% smbd libtalloc.so.2.0.7 [.] _talloc_array > > > On Mon, Sep 30, 2013 at 6:19 PM, Thiago Fernandes Crepaldi < > togn...@gmail.com> wrote: > >> Agreed. For some strange reason I though perf would "follow" the new smbd >> forked and account their data too =) >> >> Unfortunately, I don't have the libc symbols (at least for today) to see >> what is going on there, but here is what I got in the child smbd process on >> the server side. The client side is a Windows 7 Virtual machine running >> NASPT >> >> Could this result mean that most of the time the performance drop I am >> experiencing is due to libc ? >> I've never worked with perf before, but I will still try to resolve those >> crazy addresses >> >> Events: 45K cycles >> - 7.37% smbd libc-2.13.so [.] 0x11e465 >> - 0x7ffab9f2043c >> 41.73% 0 >> 5.32% 0x1b3fbe0 >> 5.29% 0x2c4dab0 >> 3.60% 0x1b0b130 >> 3.37% 0x1b0b2a0 >> 2.94% 0x1b5af80 >> 2.70% 0x1b0d850 >> 2.64% 0x2825fb0 >> 1.86% 0x28e06d0 >> 1.83% 0x2afcc80 >> 1.71% 0x1b2ccb0 >> 1.64% 0x2a4deb0 >> 1.63% 0x1b56e00 >> 1.51% 0x1b6bd00 >> 1.16% 0x1b49eb0 >> 1.15% 0x1b506e0 >> 1.13% 0x1b4da00 >> 1.07% 0x1b35100 >> 0.93% 0x1af9050 >> 0.92% 0x2b03680 >> 0.91% 0x2ae21f0 >> 0.90% 0x1b21210 >> 0.89% 0x1b5de80 >> 0.89% 0x1b5aa80 >> 0.89% 0x1b2e0e0 >> 0.88% 0x1b59be0 >> 0.87% 0x1b4c600 >> 0.86% 0x1b2aa20 >> 0.85% 0x1b4a940 >> 0.85% 0x1b45f50 >> 0.84% 0x1b4a6d0 >> 0.84% 0x1b23940 >> 0.82% 0x1b37210 >> 0.82% 0x1b2cf30 >> 0.82% 0x1b33320 >> 0.77% 0x2c96d50 >> 0.76% 0x202f380 >> 0.75% 0x2bd0bd0 >> 0.66% 0x1b5e1d0 >> - 0x7ffab9f27e10 >> 37.72% 0x2f62696c2f3365 >> + 23.78% 0 >> + 11.24% 0x7fffc9f76d40 >> + 6.25% set_unix_security_ctx >> 3.13% 0x645f6e656b6f74 >> 2.46% 0x1000900000000 >> + 2.17% 0x11b9f22aac >> 2.16% 0x1b53000 >> + 2.12% 0x2a29850 >> 2.08% 0xbe70f000004c4c >> 2.01% 0x1b0af00 >> 1.94% 0x1b07390 >> 1.51% 0x1b49b00 >> 1.41% 0x2010 >> - 0x7ffab9fc6c10 >> + 18.08% 0 >> + 13.63% 0x2c5fc20 >> + 11.62% 0x2be7b10 >> + 7.90% 0x2be8560 >> + 6.61% 0x2a29850 >> + 6.30% 0x2b3d6c0 >> 5.67% 0x4e6f5479706f43 >> + 5.64% 0x29d7110 >> + 5.54% 0x2467130 >> + 5.53% 0x2b3d5e0 >> + 5.31% 0x28c81a0 >> + 4.20% 0x2c5fa30 >> + 3.98% 0x2a98990 >> + 0x7ffab9f20438 >> + 0x7ffab9f2045c >> 0x7ffab9fc8e03 >> + 0x7ffab9fc425e >> + 0x7ffab9f2a715 >> + 0x7ffab9f2a6d0 >> 0x7ffab9f1f851 >> 0x7ffab9f1f2ac >> + 0x7ffab9f27e25 >> + 0x7ffab9f2a648 >> + 0x7ffab9fc4240 >> 0x7ffab9fc8654 >> 0x7ffab9f206bf >> + 0x7ffab9f20548 >> + 0x7ffab9f20bc2 >> + 0x7ffab9f1f130 >> + 0x7ffab9f26310 >> + 0x7ffab9f20422 >> 0x7ffab9f1e0db >> 0x7ffab9f1f179 >> + 0x7ffab9f2a6f2 >> + 0x7ffab9f20572 >> + 0x7ffab9f2054c >> + 0x7ffab9fc42c5 >> - 1.72% smbd [kernel.kallsyms] [k] kmem_cache_alloc >> + kmem_cache_alloc >> - 1.30% smbd libtalloc.so.2.0.7 [.] _talloc_free >> + _talloc_free >> - 1.10% smbd libtalloc.so.2.0.7 [.] >> _talloc_free_children_internal.i >> + _talloc_free_children_internal.isra.4 >> - 1.07% smbd [kernel.kallsyms] [k] copy_user_generic_unrolled >> + copy_user_generic_unrolled >> - 0.95% smbd [kernel.kallsyms] [k] __kmalloc >> + __kmalloc >> - 0.78% smbd [kernel.kallsyms] [k] ext4_htree_store_dirent >> + ext4_htree_store_dirent >> + 0x7ffab9f4f2f5 >> - 0.73% smbd [kernel.kallsyms] [k] kmem_cache_free >> + kmem_cache_free >> - 0.73% smbd [kernel.kallsyms] [k] link_path_walk >> + link_path_walk >> - 0.69% smbd libc-2.13.so [.] malloc >> + malloc >> - 0.69% smbd libtalloc.so.2.0.7 [.] _talloc_zero >> + _talloc_zero >> - 0.62% smbd [kernel.kallsyms] [k] fcntl_setlk >> + fcntl_setlk >> + 0x7ffabcf93238 >> - 0.59% smbd [kernel.kallsyms] [k] __d_lookup_rcu >> + __d_lookup_rcu >> - 0.57% smbd libtalloc.so.2.0.7 [.] talloc_alloc_pool >> + talloc_alloc_pool >> - 0.55% smbd libtalloc.so.2.0.7 [.] talloc_get_name >> + talloc_get_name >> - 0.55% smbd [kernel.kallsyms] [k] __posix_lock_file >> + __posix_lock_file >> + 0x7ffabcf93238 >> - 0.50% smbd [kernel.kallsyms] [k] _raw_spin_lock >> + _raw_spin_lock >> + 0.49% smbd [kernel.kallsyms] [k] tg3_start_xmit >> + 0.48% smbd [kernel.kallsyms] [k] system_call_after_swapgs >> + 0.46% smbd libtalloc.so.2.0.7 [.] talloc_named_const >> + 0.46% smbd [kernel.kallsyms] [k] memset >> + 0.46% smbd libtalloc.so.2.0.7 [.] _talloc_get_type_abort >> + 0.45% smbd [kernel.kallsyms] [k] str2hashbuf_signed >> + 0.45% smbd [kernel.kallsyms] [k] kfree >> + 0.45% smbd libc-2.13.so [.] free >> + 0.44% smbd [kernel.kallsyms] [k] __alloc_skb >> + 0.42% smbd libtalloc.so.2.0.7 [.] talloc_is_parent >> + 0.41% smbd libtalloc.so.2.0.7 [.] _talloc_array >> >> >> >> >> >> On Mon, Sep 30, 2013 at 5:39 PM, Jeremy Allison <j...@samba.org> wrote: >> >>> On Mon, Sep 30, 2013 at 05:21:44PM -0300, Thiago Fernandes Crepaldi >>> wrote: >>> > Andrew, in my company we are also experiencing a higher CPU usage of >>> Samba >>> > 4 (smbd) if compared to Samba 3. >>> > >>> > In fact, it almost reaches 100% of CPU and uses all the memory during >>> *dir >>> > copies* (individual file copy is as good as samba 3's). I strongly >>> believe >>> > that this CPU usage is the responsible for a worse samba 4's >>> throughput if >>> > compared to Samba 3 tests. >>> > >>> > Giving that, I would like to contribute with this investigation and >>> share >>> > my data regarding perf profiling on smbd (parent process) >>> > >>> > Events: 7 cycles >>> > - 90.01% smbd [kernel.kallsyms] [k] copy_pte_range >>> > copy_pte_range >>> > __libc_fork >>> > smbd_accept_connection >>> > - 9.77% smbd [kernel.kallsyms] [k] handle_edge_irq >>> > handle_edge_irq >>> > smbd_accept_connection >>> > - 0.22% smbd [kernel.kallsyms] [k] perf_pmu_rotate_start.isra.57 >>> > perf_pmu_rotate_start.isra.57 >>> > __poll >>> > - 0.00% smbd [kernel.kallsyms] [k] native_write_msr_safe >>> > native_write_msr_safe >>> > __poll >>> >>> It's the client process that should have the interesting >>> profile data, the parent is just going to sit there doing >>> accept(). >>> >>> Jeremy. >>> >> >> >> >> -- >> Thiago Crepaldi >> > > > > -- > Thiago Crepaldi > -- Thiago Crepaldi -- To unsubscribe from this list go to the following URL and read the instructions: https://lists.samba.org/mailman/options/samba