Re: Memory allocation performance
Bruce Evans wrote: Try profiling it one another type of CPU, to get different performance counters but hopefully not very different stalls. If the other CPU doesn't stall at all, put another black mark against P4 and delete your copies of it :-). I have tried to profile the same system with the same load on different hardware: - was Pentium4 2.8 at ASUS MB based on i875G chipset, - now PentiumD 3.0 at Supermicro PDSMi board based on E7230 chipset. The results are completely different. The problem has gone: 0.03 0.04 538550/2154375 ip_forward [11] 0.03 0.04 538562/2154375 em_get_buf [32] 0.07 0.08 1077100/2154375 ng_package_data [26] [15]1.8 0.14 0.15 2154375 uma_zalloc_arg [15] 0.06 0.00 1077151/3232111 generic_bzero [22] 0.03 0.00 538555/538555 mb_ctor_mbuf [60] 0.03 0.00 2154375/4421407 critical_exit [63] 0.02 0.01 538554/2154376 m_freem [42] 0.02 0.01 538563/2154376 mb_free_ext [54] 0.04 0.03 1077100/2154376 ng_free_item [48] [30]0.9 0.08 0.06 2154376 uma_zfree_arg [30] 0.03 0.00 2154376/4421407 critical_exit [63] 0.00 0.01 538563/538563 mb_dtor_pack [82] 0.01 0.00 2154376/4421971 critical_enter [69] So probably it was some hardware related problem. First MB has video integrated to chipset without any dedicated memory, possibly it affected memory performance in some way. On the first system there were such messages on boot: Mar 3 23:01:20 swamp kernel: acpi0: reservation of 0, a (3) failed Mar 3 23:01:20 swamp kernel: acpi0: reservation of 10, 3fdf (3) failed Mar 3 23:01:20 swamp kernel: agp0: controller> on vgapci0 Mar 3 23:01:20 swamp kernel: agp0: detected 892k stolen memory Mar 3 23:01:20 swamp kernel: agp0: aperture size is 128M , can they be related? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Julian Elischer <[EMAIL PROTECTED]> writes: > Dag-Erling Smørgrav <[EMAIL PROTECTED]> writes: > > Julian Elischer <[EMAIL PROTECTED]> writes: > > > you mean FILO or LIFO right? > > Uh, no. You want to reuse the last-freed object, as it is most > > likely to still be in cache. > exactly.. FILO or LIFO (last in First out.) Clearly, I must have written the above in an acute state of caffeine deprivation. You are perfectly correct. DES -- Dag-Erling Smørgrav - [EMAIL PROTECTED] ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Mon, 4 Feb 2008, Alexander Motin wrote: Kris Kennaway wrote: You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track down the precise causes. Thanks to the hint, it was interesting hunting, but it shown nothing. It hits into very simple lines like: bucket = cache->uc_freebucket; cache->uc_allocs++; if (zone->uz_ctor != NULL) { cache->uc_frees++; and so on. There is no loops, there is no inlines or macroses. Nothing! And the only hint about it is a huge number of "p4-resource-stall"s in those lines. I have no idea what exactly does it means, why does it happens mostly here and how to fight it. Try profiling it one another type of CPU, to get different performance counters but hopefully not very different stalls. If the other CPU doesn't stall at all, put another black mark against P4 and delete your copies of it :-). Bruce ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Kris Kennaway wrote: You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track down the precise causes. Thanks to the hint, it was interesting hunting, but it shown nothing. It hits into very simple lines like: bucket = cache->uc_freebucket; cache->uc_allocs++; if (zone->uz_ctor != NULL) { cache->uc_frees++; and so on. There is no loops, there is no inlines or macroses. Nothing! And the only hint about it is a huge number of "p4-resource-stall"s in those lines. I have no idea what exactly does it means, why does it happens mostly here and how to fight it. I would probably agreed that it might be some profiler fluctuation, but performance benefits I have got from self-made uma calls caching look very real. :( Robert Watson wrote: > There was, FYI, a report a few years ago that there was a measurable > improvement from allocating off the free bucket rather than maintaining > separate alloc and free buckets. It sounded good at the time but I was > never able to reproduce the benefits in my test environment. Now might > be a good time to try to revalidate that. Basically, the goal would be > to make the pcpu cache FIFO as much as possible as that maximizes the > chances that the newly allocated object already has lines in the cache. > It's a fairly trivial tweak to the UMA allocation code. I have tried this, but have not found a difference. May be it gives some benefits, but not in this situation. In this situation profiling shows delays in allocator itself, so as soon as allocator does not touches data objects itself it probably more speaks about management structure's memory caching then about objects caching. I have got one more crazy idea that memory containing zones may have some special hardware or configuration features, like "noncaching" or something alike. That could explain slowdown in accessing it. But as I can't prove it, it just one more crazy theory. :( -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Dag-Erling Smørgrav wrote: Julian Elischer <[EMAIL PROTECTED]> writes: Robert Watson <[EMAIL PROTECTED]> writes: be a good time to try to revalidate that. Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the cache. It's a fairly trivial tweak to the UMA allocation code. you mean FILO or LIFO right? Uh, no. You want to reuse the last-freed object, as it is most likely to still be in cache. DES exactly.. FILO or LIFO (last in First out.) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Julian Elischer <[EMAIL PROTECTED]> writes: > Robert Watson <[EMAIL PROTECTED]> writes: > > be a good time to try to revalidate that. Basically, the goal would > > be to make the pcpu cache FIFO as much as possible as that maximizes > > the chances that the newly allocated object already has lines in the > > cache. It's a fairly trivial tweak to the UMA allocation code. > you mean FILO or LIFO right? Uh, no. You want to reuse the last-freed object, as it is most likely to still be in cache. DES -- Dag-Erling Smørgrav - [EMAIL PROTECTED] ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Robert Watson wrote: be a good time to try to revalidate that. Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the you mean FILO or LIFO right? chances that the newly allocated object already has lines in the cache. It's a fairly trivial tweak to the UMA allocation code. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Sun, 3 Feb 2008, Alexander Motin wrote: Robert Watson wrote: Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the cache. Why FIFO? I think LIFO (stack) should be better for this goal as the last freed object has more chances to be still present in cache. Sorry, brain-o -- indeed, as I described, LIFO, rather than as a I wrote. :-) Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Robert Watson wrote: Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the cache. Why FIFO? I think LIFO (stack) should be better for this goal as the last freed object has more chances to be still present in cache. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Am Sa, 2.02.2008, 23:05, schrieb Alexander Motin: > Robert Watson wrote: >> Hence my request for drilling down a bit on profiling -- the question >> I'm asking is whether profiling shows things running or taking time that >> shouldn't be. > > I have not yet understood why does it happend, but hwpmc shows huge > amount of "p4-resource-stall"s in UMA functions: >% cumulative self self total > time seconds secondscalls ms/call ms/call name > 45.22303.00 2303.000 100.00% uma_zfree_arg [1] > 41.24402.00 2099.000 100.00% uma_zalloc_arg [2] >1.44472.0070.000 100.00% > uma_zone_exhausted_nolock [3] >0.94520.0048.000 100.00% ng_snd_item [4] >0.84562.0042.000 100.00% __qdivrem [5] >0.84603.0041.000 100.00% ether_input [6] >0.64633.0030.000 100.00% ng_ppp_prepend [7] > > Probably it explains why "p4-global-power-events" shows many hits into > them >% cumulative self self total > time seconds secondscalls ms/call ms/call name > 20.0 37984.00 37984.000 100.00% uma_zfree_arg [1] > 17.8 71818.00 33834.000 100.00% uma_zalloc_arg [2] >4.0 79483.00 7665.000 100.00% ng_snd_item [3] >3.0 85256.00 5773.000 100.00% __mcount [4] >2.3 89677.00 4421.000 100.00% bcmp [5] >2.2 93853.00 4176.000 100.00% generic_bcopy [6] > > , while "p4-instr-retired" does not. >% cumulative self self total > time seconds secondscalls ms/call ms/call name > 11.15351.00 5351.000 100.00% ng_apply_item [1] >7.99178.00 3827.000 100.00% > legacy_pcib_alloc_msi [2] >4.1 11182.00 2004.000 100.00% init386 [3] >4.0 13108.00 1926.000 100.00% rn_match [4] >3.5 14811.00 1703.000 100.00% uma_zalloc_arg [5] >2.6 16046.00 1235.000 100.00% SHA256_Transform > [6] >2.2 17130.00 1084.000 100.00% ng_add_hook [7] >2.0 18111.00 981.000 100.00% ng_rmhook_self [8] >2.0 19054.00 943.000 100.00% em_encap [9] > > For this moment I have invent two possible explanation. One is that due > to UMA's cyclic block allocation order it does not fits CPU caches and > another that it is somehow related to critical_exit(), which possibly > can cause context switch. Does anybody have better explanation how such > small and simple in this part function can cause such results? I didn't see bzero accounted for in any of the traces in this thread - makes me wonder if that might mean that it's counted within uma_zalloc? Maybe we are calling it twice by accident? I wasn't quite able to figure out the logic of M_ZERO vs. UMA_ZONE_MALLOC etc. ... just a crazy idea. -- /"\ Best regards, | [EMAIL PROTECTED] \ / Max Laier | ICQ #67774661 X http://pf4freebsd.love2party.net/ | [EMAIL PROTECTED] / \ ASCII Ribbon Campaign | Against HTML Mail and News ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Sat, 2 Feb 2008, Kris Kennaway wrote: Alexander Motin wrote: Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend, but hwpmc shows huge amount of "p4-resource-stall"s in UMA functions: For this moment I have invent two possible explanation. One is that due to UMA's cyclic block allocation order it does not fits CPU caches and another that it is somehow related to critical_exit(), which possibly can cause context switch. Does anybody have better explanation how such small and simple in this part function can cause such results? You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track down the precise causes. There was, FYI, a report a few years ago that there was a measurable improvement from allocating off the free bucket rather than maintaining separate alloc and free buckets. It sounded good at the time but I was never able to reproduce the benefits in my test environment. Now might be a good time to try to revalidate that. Basically, the goal would be to make the pcpu cache FIFO as much as possible as that maximizes the chances that the newly allocated object already has lines in the cache. It's a fairly trivial tweak to the UMA allocation code. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin wrote: Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend, but hwpmc shows huge amount of "p4-resource-stall"s in UMA functions: For this moment I have invent two possible explanation. One is that due to UMA's cyclic block allocation order it does not fits CPU caches and another that it is somehow related to critical_exit(), which possibly can cause context switch. Does anybody have better explanation how such small and simple in this part function can cause such results? You can look at the raw output from pmcstat, which is a collection of instruction pointers that you can feed to e.g. addr2line to find out exactly where in those functions the events are occurring. This will often help to track down the precise causes. Kris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Robert Watson wrote: Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. I have not yet understood why does it happend, but hwpmc shows huge amount of "p4-resource-stall"s in UMA functions: % cumulative self self total time seconds secondscalls ms/call ms/call name 45.22303.00 2303.000 100.00% uma_zfree_arg [1] 41.24402.00 2099.000 100.00% uma_zalloc_arg [2] 1.44472.0070.000 100.00% uma_zone_exhausted_nolock [3] 0.94520.0048.000 100.00% ng_snd_item [4] 0.84562.0042.000 100.00% __qdivrem [5] 0.84603.0041.000 100.00% ether_input [6] 0.64633.0030.000 100.00% ng_ppp_prepend [7] Probably it explains why "p4-global-power-events" shows many hits into them % cumulative self self total time seconds secondscalls ms/call ms/call name 20.0 37984.00 37984.000 100.00% uma_zfree_arg [1] 17.8 71818.00 33834.000 100.00% uma_zalloc_arg [2] 4.0 79483.00 7665.000 100.00% ng_snd_item [3] 3.0 85256.00 5773.000 100.00% __mcount [4] 2.3 89677.00 4421.000 100.00% bcmp [5] 2.2 93853.00 4176.000 100.00% generic_bcopy [6] , while "p4-instr-retired" does not. % cumulative self self total time seconds secondscalls ms/call ms/call name 11.15351.00 5351.000 100.00% ng_apply_item [1] 7.99178.00 3827.000 100.00% legacy_pcib_alloc_msi [2] 4.1 11182.00 2004.000 100.00% init386 [3] 4.0 13108.00 1926.000 100.00% rn_match [4] 3.5 14811.00 1703.000 100.00% uma_zalloc_arg [5] 2.6 16046.00 1235.000 100.00% SHA256_Transform [6] 2.2 17130.00 1084.000 100.00% ng_add_hook [7] 2.0 18111.00 981.000 100.00% ng_rmhook_self [8] 2.0 19054.00 943.000 100.00% em_encap [9] For this moment I have invent two possible explanation. One is that due to UMA's cyclic block allocation order it does not fits CPU caches and another that it is somehow related to critical_exit(), which possibly can cause context switch. Does anybody have better explanation how such small and simple in this part function can cause such results? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Sat, Feb 02, 2008 at 09:56:42PM +0200, Alexander Motin wrote: >Peter Jeremy ?: >> On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote: >>> To check UMA dependency I have made a trivial one-element cache which in >>> my test case allows to avoid two for four allocations per packet. >> >> You should be able to implement this lockless using atomic(9). I haven't >> verified it, but the following should work. > >I have tried this, but man 9 atomic talks: > >The atomic_readandclear() functions are not implemented for the types >``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any >variants with memory barriers at this time. Hmmm. This seems to be more a documentation bug than missing code: atomic_readandclear_ptr() seems to be implemented on most architectures (the only one where I can't find it is arm) and is already used in malloc(3). -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgpr4vs01cZsJ.pgp Description: PGP signature
Re: Memory allocation performance
Peter Jeremy пишет: On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote: To check UMA dependency I have made a trivial one-element cache which in my test case allows to avoid two for four allocations per packet. You should be able to implement this lockless using atomic(9). I haven't verified it, but the following should work. I have tried this, but man 9 atomic talks: The atomic_readandclear() functions are not implemented for the types ``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any variants with memory barriers at this time. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote: >To check UMA dependency I have made a trivial one-element cache which in my >test case allows to avoid two for four allocations per packet. You should be able to implement this lockless using atomic(9). I haven't verified it, but the following should work. >.alloc. >- item = uma_zalloc(ng_qzone, wait | M_ZERO); >+ mtx_lock_spin(&itemcachemtx); >+ item = itemcache; >+ itemcache = NULL; >+ mtx_unlock_spin(&itemcachemtx); = item = atomic_readandclear_ptr(&itemcache); >+ if (item == NULL) >+ item = uma_zalloc(ng_qzone, wait | M_ZERO); >+ else >+ bzero(item, sizeof(*item)); >.free. >- uma_zfree(ng_qzone, item); >+ mtx_lock_spin(&itemcachemtx); >+ if (itemcache == NULL) { >+ itemcache = item; >+ item = NULL; >+ } >+ mtx_unlock_spin(&itemcachemtx); >+ if (item) >+ uma_zfree(ng_qzone, item); = if (atomic_cmpset_ptr(&itemcache, NULL, item) == 0) = uma_zfree(ng_qzone, item); -- Peter Jeremy Please excuse any delays as the result of my ISP's inability to implement an MTA that is either RFC2821-compliant or matches their claimed behaviour. pgp4quq6NRpnZ.pgp Description: PGP signature
Re: Memory allocation performance
> Thanks, I have already found this. There was only problem, that by > default it counts cycles only when both logical cores are active while > one of my cores was halted. Did you try the 'active' event modifier: "p4-global-power-events,active=any"? > Sampling on this, profiler shown results close to usual profiling, but > looking more random: Adding '-fno-omit-frame-pointer' to CFLAGS may help hwpmc to capture callchains better. Regards, Koshy ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Joseph Koshy wrote: You cannot sample with the TSC since the TSC does not interrupt the CPU. For CPU cycles you would probably want to use "p4-global-power-events"; see pmc(3). Thanks, I have already found this. There was only problem, that by default it counts cycles only when both logical cores are active while one of my cores was halted. Sampling on this, profiler shown results close to usual profiling, but looking more random: 175.97 1.49 1/64 ip_input [49] 175.97 1.49 1/64 g_alloc_bio [81] 175.97 1.49 1/64 ng_package_data [18] 1055.81 8.93 6/64 em_handle_rxtx [4] 2639.5322.32 15/64 em_get_buf [19] 3343.4128.27 19/64 ng_getqblk [17] 3695.3431.25 21/64 ip_forward [14] [9]21.6 11262.00 95.23 64 uma_zalloc_arg [9] 35.4513.03 5/22 critical_exit [75] 26.86 0.00 22/77 critical_enter [99] 19.89 0.00 18/19 mb_ctor_mbuf [141] 31.87 0.24 4/1324ng_ether_rcvdata [13] 31.87 0.24 4/1324ip_forward [14] 95.60 0.73 12/1324ng_iface_rcvdata [16] 103.57 0.79 13/1324m_freem [25] 876.34 6.71 110/1324mb_free_ext [30] 9408.7572.011181/1324ng_free_item [11] [10]20.2 10548.00 80.731324 uma_zfree_arg [10] 26.86 0.00 22/77 critical_enter [99] 15.0011.59 7/7 mb_dtor_mbuf [134] 19.00 6.62 4/4 mb_dtor_pack [136] 1.66 0.00 1/32 m_tag_delete_chain [114] 21.4 11262.00 11262.00 64 175968.75 177456.76 uma_zalloc_arg [9] 20.1 21810.00 10548.00 1324 7966.77 8027.74 uma_zfree_arg [10] 5.6 24773.00 2963.00 1591 1862.35 2640.07 ng_snd_item [15] 3.5 26599.00 1826.00 33 55333.33 55333.33 ng_address_hook [20] 2.4 27834.00 1235.00 319 3871.47 3871.47 ng_acquire_read [28] To make statistics better I need to record sampling data with smaller period, but too much data creates additional overhead including disc operations and brakes statistics. Is there any way to make it more precise? What sampling parameters should I use for better results? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
> I have tried it for measuring number of instructions. But I am in doubt > that instructions is a correct counter for performance measurement as > different instructions may have very different execution times depending > on many reasons, like cache misses and current memory traffic. I have > tried to use tsc to count CPU cycles, but got the error: > # pmcstat -n 1 -S "tsc" -O sample.out > pmcstat: ERROR: Cannot allocate system-mode pmc with specification > "tsc": Operation not supported > What have I missed? You cannot sample with the TSC since the TSC does not interrupt the CPU. For CPU cycles you would probably want to use "p4-global-power-events"; see pmc(3). Regards, Koshy ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Sat, 2 Feb 2008, Alexander Motin wrote: Robert Watson wrote: I guess the question is: where are the cycles going? Are we suffering excessive cache misses in managing the slabs? Are you effectively "cycling through" objects rather than using a smaller set that fits better in the cache? In my test setup only several objects from zone usually allocated same time, but they allocated two times per every packet. To check UMA dependency I have made a trivial one-element cache which in my test case allows to avoid two for four allocations per packet. Avoiding unnecessary allocations is a good general principle, but duplicating cache logic is a bad idea. If you're able to structure the below without using locking, it strikes me you'd do much better, especially if it's in a single processing pass. Can you not use a per-thread/stack/session variable to avoid that? .alloc. - item = uma_zalloc(ng_qzone, wait | M_ZERO); + mtx_lock_spin(&itemcachemtx); + item = itemcache; + itemcache = NULL; + mtx_unlock_spin(&itemcachemtx); Why are you using spin locks? They are quite a bit more expensive on several hardwawre platforms, and any environment it's safe to call uma_zalloc() from will be equally safe to use regular mutexes from (i.e., mutex-sleepable). + if (item == NULL) + item = uma_zalloc(ng_qzone, wait | M_ZERO); + else + bzero(item, sizeof(*item)); .free. - uma_zfree(ng_qzone, item); + mtx_lock_spin(&itemcachemtx); + if (itemcache == NULL) { + itemcache = item; + item = NULL; + } + mtx_unlock_spin(&itemcachemtx); + if (item) + uma_zfree(ng_qzone, item); ... To be sure that test system is CPU-bound I have throttled it with sysctl to 1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has grown from 17 to 21Mbytes/s. Profiling results I have sent promised close results. Is some bit of debugging enabled that shouldn't be, perhaps due to a failure of ifdefs? I have commented out all INVARIANTS and WITNESS options from GENERIC kernel config. What else should I check? Hence my request for drilling down a bit on profiling -- the question I'm asking is whether profiling shows things running or taking time that shouldn't be. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Robert Watson wrote: I guess the question is: where are the cycles going? Are we suffering excessive cache misses in managing the slabs? Are you effectively "cycling through" objects rather than using a smaller set that fits better in the cache? In my test setup only several objects from zone usually allocated same time, but they allocated two times per every packet. To check UMA dependency I have made a trivial one-element cache which in my test case allows to avoid two for four allocations per packet. .alloc. - item = uma_zalloc(ng_qzone, wait | M_ZERO); + mtx_lock_spin(&itemcachemtx); + item = itemcache; + itemcache = NULL; + mtx_unlock_spin(&itemcachemtx); + if (item == NULL) + item = uma_zalloc(ng_qzone, wait | M_ZERO); + else + bzero(item, sizeof(*item)); .free. - uma_zfree(ng_qzone, item); + mtx_lock_spin(&itemcachemtx); + if (itemcache == NULL) { + itemcache = item; + item = NULL; + } + mtx_unlock_spin(&itemcachemtx); + if (item) + uma_zfree(ng_qzone, item); ... To be sure that test system is CPU-bound I have throttled it with sysctl to 1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has grown from 17 to 21Mbytes/s. Profiling results I have sent promised close results. Is some bit of debugging enabled that shouldn't be, perhaps due to a failure of ifdefs? I have commented out all INVARIANTS and WITNESS options from GENERIC kernel config. What else should I check? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Fri, 1 Feb 2008, Alexander Motin wrote: Robert Watson wrote: It would be very helpful if you could try doing some analysis with hwpmc -- "high resolution profiling" is of increasingly limited utility with modern You mean "of increasingly greater utility with modern CPUs". Low resolution kernel profiling stopped giving enough resolution in about 1990, and has become of increasingly limited utility since then, but high resolution kernel profiling uses the TSC or possibly a performance counter so it has kept up with CPU speedups. Cache effects and out of order execution are larger now, but they affect all types of profiling and still not too bad with high resulotion kernel profiling. High resolution kernel profiling doesn't really work with SMP, but that is not Alexander's problem since he profiled under UP. CPUs, where even a high frequency timer won't run very often. It's also quite subject to cycle events that align with other timers in the system. No, it isn't affected by either of these. The TSC timer is incremented on every CPU cycle and the performance counters run are incremented on every event. It is completely unaffected by other timers. I have tried hwpmc but still not completely friendly with it. Whole picture is somewhat alike to kgmon's, but it looks very noisy. Is there some "know how" about how to use it better? hwpmc doesn't work for me either. I can't see how it could work as well as high resolution kernel profiling for events at the single function level, since it is statistics-based. With the statistics clock interrupt rate fairly limited, it just cannot get enough resolution over short runs. Also, it works poorly for me (with a current kernel and ~5.2 userland except for some utilities like pmc*). Generation of profiles stopped working for me, and it often fails with allocation errors. I have tried it for measuring number of instructions. But I am in doubt that instructions is a correct counter for performance measurement as different instructions may have very different execution times depending on many reasons, like cache misses and current memory traffic. Cycle counts are more useful, but high resolution kernel profiling can do this too, with some fixes: - update perfmon for newer CPUs. It is broken even for Athlons (takes a 2 line fix, or more lines with proper #defines and if()s). - select the performance counter to be used for profiling using sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the performance counter for the instruction count (or any). I use hwpmc mainly to determine N :-). It may also be necessary to change the kernel variable cpu_clock_pmc_conf. Configuration of this is unfinished. - use high resolution kernel profiling normally. Note that switching to a perfmon counter is only permitted of !SMP (since it is too unsupported under SMP to do more than panic if permitted under SMP), and that the switch loses the calibration of profiling. Profiling normally compensates for overheads of the profiling itself, and the compensation would work almoost perfectly for event counters, unlike for time-related counters, since the extra events for profiling aren't much affected by caches. I have tried to use tsc to count CPU cycles, but got the error: # pmcstat -n 1 -S "tsc" -O sample.out pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc": Operation not supported What have I missed? This might be just because the TSC really is not supported. Many things require an APIC for hwpmc to support them. I get errors allocation like this for operations that work a few times before failing. I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel built without SMP to simplify profiling. What counters can you recommend me to use on it for regular time profiling? Try them all :-). From userland first with an overall count, since looking at the details in gprof output takes too long (and doesn't work for me with hwpmc anyway). I use scripts like the following to try them all from userland: runpm: %%% c="ttcp -n10 -l5 -u -t epsplex" ctr=0 while test $ctr -lt 256 do ctr1=$(printf "0x%02x\n" $ctr) case $ctr1 in 0x00) src=k8-fp-dispatched-fpu-ops;; 0x01) src=k8-fp-cycles-with-no-fpu-ops-retired;; 0x02) src=k8-fp-dispatched-fpu-fast-flag-ops;; 0x05) src=k8-fp-unknown-$ctr1;; 0x09) src=k8-fp-unknown-$ctr1;; 0x0d) src=k8-fp-unknown-$ctr1;; 0x11) src=k8-fp-unknown-$ctr1;; 0x15) src=k8-fp-unknown-$ctr1;; 0x19) src=k8-fp-unknown-$ctr1;; 0x1d) src=k8-fp-unknown-$ctr1;; 0x20) src=k8-ls-segment-register-load;; # XXX 0x21) src=kx-ls-microarchitectural-resync-by-self-mod-code;; 0x22) src=k8-ls-microarchitectural-resync-by-snoop;; 0x23) src=kx-ls-buffer2-full;; 0x24) src=k8-ls-locked-operation;;
Re: Memory allocation performance
Hi. Robert Watson wrote: It would be very helpful if you could try doing some analysis with hwpmc -- "high resolution profiling" is of increasingly limited utility with modern CPUs, where even a high frequency timer won't run very often. It's also quite subject to cycle events that align with other timers in the system. I have tried hwpmc but still not completely friendly with it. Whole picture is somewhat alike to kgmon's, but it looks very noisy. Is there some "know how" about how to use it better? I have tried it for measuring number of instructions. But I am in doubt that instructions is a correct counter for performance measurement as different instructions may have very different execution times depending on many reasons, like cache misses and current memory traffic. I have tried to use tsc to count CPU cycles, but got the error: # pmcstat -n 1 -S "tsc" -O sample.out pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc": Operation not supported What have I missed? I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel built without SMP to simplify profiling. What counters can you recommend me to use on it for regular time profiling? Thanks for reply. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
On Fri, 1 Feb 2008, Alexander Motin wrote: That was actually my second question. As there is only 512 items by default and they are small in size I can easily preallocate them all on boot. But is it a good way? Why UMA can't do just the same when I have created zone with specified element size and maximum number of objects? What is the principal difference? Alexander, I think we should drill down in the analysis a bit and see if we can figure out what's going on with UMA. What UMA essentially does is ask the VM for pages, and then pack objects into pages. It maintains some meta-data, and depending on the relative sizes of objects and pages, it may store it in the page or potentially elsewhere. Either way, it looks very much an array of struct object. It has a few extra layers of wrapping in order to maintain stats, per-CPU caches, object life cycle, etc. When INVARIANTS is turned off, allocation from the per-CPU cache consists of pulling objects in and out of one of two per-CPU queues. So I guess the question is: where are the cycles going? Are we suffering excessive cache misses in managing the slabs? Are you effectively "cycling through" objects rather than using a smaller set that fits better in the cache? Is some bit of debugging enabled that shouldn't be, perhaps due to a failure of ifdefs? BTW, UMA does let you set the size of buckets, so you can try tuning the bucket size. For starts, try setting the zone flag UMA_ZONE_MAXBUCKET. It would be very helpful if you could try doing some analysis with hwpmc -- "high resolution profiling" is of increasingly limited utility with modern CPUs, where even a high frequency timer won't run very often. It's also quite subject to cycle events that align with other timers in the system. Robert N M Watson Computer Laboratory University of Cambridge ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin wrote: Kris Kennaway пишет: Alexander Motin wrote: Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS, WITNESS and SMP but with 'profile 2'. What is 'profile 2'? I have thought it is high resolution profiling support. Isn't it? OK. This is not commonly used so I don't know if it works. Try using hwpmc if possible to compare. When you say that your own allocation routines show less time use under profiling, how do they affect the actual system performance? Kris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin wrote: Julian Elischer пишет: Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18] 0.27 0.10 266236/545292 ng_package_data [17] [9]14.1 0.56 0.21 545292 uma_zalloc_arg [9] 0.17 0.00 545292/1733401 critical_exit [98] 0.01 0.00 275941/679675 generic_bzero [68] 0.01 0.00 133127/133127 mb_ctor_pack [103] 0.15 0.06 133100/545266 mb_free_ext [22] 0.15 0.06 133121/545266 m_freem [15] 0.29 0.11 266236/545266 ng_free_item [16] [8]15.2 0.60 0.23 545266 uma_zfree_arg [8] 0.17 0.00 545266/1733401 critical_exit [98] 0.00 0.04 133100/133100 mb_dtor_pack [57] 0.00 0.00 134121/134121 mb_dtor_mbuf [111] I have already optimized all possible allocation calls and those that left are practically unavoidable. But even after this kgmon tells that 30% of CPU time consumed by memory management. So I have some questions: 1) Is it real situation or just profiler mistake? 2) If it is real then why UMA is so slow? I have tried to replace it in some places with preallocated TAILQ of required memory blocks protected by mutex and according to profiler I have got _much_ better results. Will it be a good practice to replace relatively small UMA zones with preallocated queue to avoid part of UMA calls? 3) I have seen that UMA does some kind of CPU cache affinity, but does it cost so much that it costs 30% CPU time on UP router? given this information, I would add an 'item cache' in ng_base.c (hmm do I already have one?) That was actually my second question. As there is only 512 items by default and they are small in size I can easily preallocate them all on boot. But is it a good way? Why UMA can't do just the same when I have created zone with specified element size and maximum number of objects? What is the principal difference? who knows what uma does.. but if you do it yourself you know what the overhead is.. :-) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Julian Elischer пишет: Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18] 0.27 0.10 266236/545292 ng_package_data [17] [9]14.1 0.56 0.21 545292 uma_zalloc_arg [9] 0.17 0.00 545292/1733401 critical_exit [98] 0.01 0.00 275941/679675 generic_bzero [68] 0.01 0.00 133127/133127 mb_ctor_pack [103] 0.15 0.06 133100/545266 mb_free_ext [22] 0.15 0.06 133121/545266 m_freem [15] 0.29 0.11 266236/545266 ng_free_item [16] [8]15.2 0.60 0.23 545266 uma_zfree_arg [8] 0.17 0.00 545266/1733401 critical_exit [98] 0.00 0.04 133100/133100 mb_dtor_pack [57] 0.00 0.00 134121/134121 mb_dtor_mbuf [111] I have already optimized all possible allocation calls and those that left are practically unavoidable. But even after this kgmon tells that 30% of CPU time consumed by memory management. So I have some questions: 1) Is it real situation or just profiler mistake? 2) If it is real then why UMA is so slow? I have tried to replace it in some places with preallocated TAILQ of required memory blocks protected by mutex and according to profiler I have got _much_ better results. Will it be a good practice to replace relatively small UMA zones with preallocated queue to avoid part of UMA calls? 3) I have seen that UMA does some kind of CPU cache affinity, but does it cost so much that it costs 30% CPU time on UP router? given this information, I would add an 'item cache' in ng_base.c (hmm do I already have one?) That was actually my second question. As there is only 512 items by default and they are small in size I can easily preallocate them all on boot. But is it a good way? Why UMA can't do just the same when I have created zone with specified element size and maximum number of objects? What is the principal difference? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Kris Kennaway пишет: Alexander Motin wrote: Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS, WITNESS and SMP but with 'profile 2'. What is 'profile 2'? I have thought it is high resolution profiling support. Isn't it? -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin wrote: Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS, WITNESS and SMP but with 'profile 2'. What is 'profile 2'? Kris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18] 0.27 0.10 266236/545292 ng_package_data [17] [9]14.1 0.56 0.21 545292 uma_zalloc_arg [9] 0.17 0.00 545292/1733401 critical_exit [98] 0.01 0.00 275941/679675 generic_bzero [68] 0.01 0.00 133127/133127 mb_ctor_pack [103] 0.15 0.06 133100/545266 mb_free_ext [22] 0.15 0.06 133121/545266 m_freem [15] 0.29 0.11 266236/545266 ng_free_item [16] [8]15.2 0.60 0.23 545266 uma_zfree_arg [8] 0.17 0.00 545266/1733401 critical_exit [98] 0.00 0.04 133100/133100 mb_dtor_pack [57] 0.00 0.00 134121/134121 mb_dtor_mbuf [111] I have already optimized all possible allocation calls and those that left are practically unavoidable. But even after this kgmon tells that 30% of CPU time consumed by memory management. So I have some questions: 1) Is it real situation or just profiler mistake? 2) If it is real then why UMA is so slow? I have tried to replace it in some places with preallocated TAILQ of required memory blocks protected by mutex and according to profiler I have got _much_ better results. Will it be a good practice to replace relatively small UMA zones with preallocated queue to avoid part of UMA calls? 3) I have seen that UMA does some kind of CPU cache affinity, but does it cost so much that it costs 30% CPU time on UP router? given this information, I would add an 'item cache' in ng_base.c (hmm do I already have one?) Thanks! ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin wrote: Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18] 0.27 0.10 266236/545292 ng_package_data [17] [9]14.1 0.56 0.21 545292 uma_zalloc_arg [9] 0.17 0.00 545292/1733401 critical_exit [98] 0.01 0.00 275941/679675 generic_bzero [68] 0.01 0.00 133127/133127 mb_ctor_pack [103] 0.15 0.06 133100/545266 mb_free_ext [22] 0.15 0.06 133121/545266 m_freem [15] 0.29 0.11 266236/545266 ng_free_item [16] [8]15.2 0.60 0.23 545266 uma_zfree_arg [8] 0.17 0.00 545266/1733401 critical_exit [98] 0.00 0.04 133100/133100 mb_dtor_pack [57] 0.00 0.00 134121/134121 mb_dtor_mbuf [111] I have already optimized all possible allocation calls and those that left are practically unavoidable. But even after this kgmon tells that 30% of CPU time consumed by memory management. So I have some questions: 1) Is it real situation or just profiler mistake? 2) If it is real then why UMA is so slow? I have tried to replace it in some places with preallocated TAILQ of required memory blocks protected by mutex and according to profiler I have got _much_ better results. Will it be a good practice to replace relatively small UMA zones with preallocated queue to avoid part of UMA calls? 3) I have seen that UMA does some kind of CPU cache affinity, but does it cost so much that it costs 30% CPU time on UP router? Make sure you have INVARIANTS disabled, it has a high performance cost in UMA. Kris ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Memory allocation performance
Alexander Motin пишет: While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: I have forgotten to tell that it was mostly GENERIC kernel just built without INVARIANTS, WITNESS and SMP but with 'profile 2'. -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Memory allocation performance
Hi. While profiling netgraph operation on UP HEAD router I have found that huge amount of time it spent on memory allocation/deallocation: 0.14 0.05 132119/545292 ip_forward [12] 0.14 0.05 133127/545292 fxp_add_rfabuf [18] 0.27 0.10 266236/545292 ng_package_data [17] [9]14.1 0.56 0.21 545292 uma_zalloc_arg [9] 0.17 0.00 545292/1733401 critical_exit [98] 0.01 0.00 275941/679675 generic_bzero [68] 0.01 0.00 133127/133127 mb_ctor_pack [103] 0.15 0.06 133100/545266 mb_free_ext [22] 0.15 0.06 133121/545266 m_freem [15] 0.29 0.11 266236/545266 ng_free_item [16] [8]15.2 0.60 0.23 545266 uma_zfree_arg [8] 0.17 0.00 545266/1733401 critical_exit [98] 0.00 0.04 133100/133100 mb_dtor_pack [57] 0.00 0.00 134121/134121 mb_dtor_mbuf [111] I have already optimized all possible allocation calls and those that left are practically unavoidable. But even after this kgmon tells that 30% of CPU time consumed by memory management. So I have some questions: 1) Is it real situation or just profiler mistake? 2) If it is real then why UMA is so slow? I have tried to replace it in some places with preallocated TAILQ of required memory blocks protected by mutex and according to profiler I have got _much_ better results. Will it be a good practice to replace relatively small UMA zones with preallocated queue to avoid part of UMA calls? 3) I have seen that UMA does some kind of CPU cache affinity, but does it cost so much that it costs 30% CPU time on UP router? Thanks! -- Alexander Motin ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"