Re: Memory allocation performance

2008-03-05 Thread Alexander Motin

Bruce Evans wrote:

Try profiling it one another type of CPU, to get different performance
counters but hopefully not very different stalls.  If the other CPU doesn't
stall at all, put another black mark against P4 and delete your copies of
it :-).


I have tried to profile the same system with the same load on different 
hardware:

 - was Pentium4 2.8 at ASUS MB based on i875G chipset,
 - now PentiumD 3.0 at Supermicro PDSMi board based on E7230 chipset.

The results are completely different. The problem has gone:
0.03 0.04  538550/2154375 ip_forward  [11]
0.03 0.04  538562/2154375 em_get_buf [32]
0.07 0.08 1077100/2154375 ng_package_data [26]
[15]1.8 0.14 0.15 2154375 uma_zalloc_arg [15]
0.06 0.00 1077151/3232111 generic_bzero [22]
0.03 0.00  538555/538555  mb_ctor_mbuf [60]
0.03 0.00 2154375/4421407 critical_exit  [63]

0.02 0.01  538554/2154376 m_freem [42]
0.02 0.01  538563/2154376 mb_free_ext [54]
0.04 0.03 1077100/2154376 ng_free_item [48]
[30]0.9 0.08 0.06 2154376 uma_zfree_arg [30]
0.03 0.00 2154376/4421407 critical_exit  [63]
0.00 0.01  538563/538563  mb_dtor_pack [82]
0.01 0.00 2154376/4421971 critical_enter [69]

So probably it was some hardware related problem. First MB has video 
integrated to chipset without any dedicated memory, possibly it affected 
memory performance in some way. On the first system there were such 
messages on boot:

Mar  3 23:01:20 swamp kernel: acpi0: reservation of 0, a (3) failed
Mar  3 23:01:20 swamp kernel: acpi0: reservation of 10, 3fdf (3) 
failed
Mar  3 23:01:20 swamp kernel: agp0: controller> on vgapci0

Mar  3 23:01:20 swamp kernel: agp0: detected 892k stolen memory
Mar  3 23:01:20 swamp kernel: agp0: aperture size is 128M
, can they be related?

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-04 Thread Dag-Erling Smørgrav
Julian Elischer <[EMAIL PROTECTED]> writes:
> Dag-Erling Smørgrav <[EMAIL PROTECTED]> writes:
> > Julian Elischer <[EMAIL PROTECTED]> writes:
> > > you mean FILO or LIFO right?
> > Uh, no.  You want to reuse the last-freed object, as it is most
> > likely to still be in cache.
> exactly.. FILO or LIFO (last in First out.)

Clearly, I must have written the above in an acute state of caffeine
deprivation.  You are perfectly correct.

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-03 Thread Bruce Evans

On Mon, 4 Feb 2008, Alexander Motin wrote:


Kris Kennaway wrote:
You can look at the raw output from pmcstat, which is a collection of 
instruction pointers that you can feed to e.g. addr2line to find out 
exactly where in those functions the events are occurring.  This will often 
help to track down the precise causes.


Thanks to the hint, it was interesting hunting, but it shown nothing. It hits 
into very simple lines like:

bucket = cache->uc_freebucket;
cache->uc_allocs++;
if (zone->uz_ctor != NULL) {
cache->uc_frees++;
and so on.
There is no loops, there is no inlines or macroses. Nothing! And the only 
hint about it is a huge number of "p4-resource-stall"s in those lines. I have 
no idea what exactly does it means, why does it happens mostly here and how 
to fight it.


Try profiling it one another type of CPU, to get different performance
counters but hopefully not very different stalls.  If the other CPU doesn't
stall at all, put another black mark against P4 and delete your copies of
it :-).

Bruce
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-03 Thread Alexander Motin

Kris Kennaway wrote:
You can look at the raw output from pmcstat, which is a collection of 
instruction pointers that you can feed to e.g. addr2line to find out 
exactly where in those functions the events are occurring.  This will 
often help to track down the precise causes.


Thanks to the hint, it was interesting hunting, but it shown nothing. It 
hits into very simple lines like:

bucket = cache->uc_freebucket;
cache->uc_allocs++;
if (zone->uz_ctor != NULL) {
cache->uc_frees++;
and so on.
There is no loops, there is no inlines or macroses. Nothing! And the 
only hint about it is a huge number of "p4-resource-stall"s in those 
lines. I have no idea what exactly does it means, why does it happens 
mostly here and how to fight it.


I would probably agreed that it might be some profiler fluctuation, but 
performance benefits I have got from self-made uma calls caching look 
very real. :(


Robert Watson wrote:
> There was, FYI, a report a few years ago that there was a measurable
> improvement from allocating off the free bucket rather than maintaining
> separate alloc and free buckets.  It sounded good at the time but I was
> never able to reproduce the benefits in my test environment.  Now might
> be a good time to try to revalidate that.  Basically, the goal would be
> to make the pcpu cache FIFO as much as possible as that maximizes the
> chances that the newly allocated object already has lines in the cache.
> It's a fairly trivial tweak to the UMA allocation code.

I have tried this, but have not found a difference. May be it gives some 
benefits, but not in this situation. In this situation profiling shows 
delays in allocator itself, so as soon as allocator does not touches 
data objects itself it probably more speaks about management structure's 
memory caching then about objects caching.


I have got one more crazy idea that memory containing zones may have 
some special hardware or configuration features, like "noncaching" or 
something alike. That could explain slowdown in accessing it. But as I 
can't prove it, it just one more crazy theory. :(


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-03 Thread Julian Elischer

Dag-Erling Smørgrav wrote:

Julian Elischer <[EMAIL PROTECTED]> writes:

Robert Watson <[EMAIL PROTECTED]> writes:

be a good time to try to revalidate that.  Basically, the goal would
be to make the pcpu cache FIFO as much as possible as that maximizes
the chances that the newly allocated object already has lines in the
cache.  It's a fairly trivial tweak to the UMA allocation code.

you mean FILO or LIFO right?


Uh, no.  You want to reuse the last-freed object, as it is most likely
to still be in cache.

DES



exactly.. FILO or LIFO (last in First out.)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-03 Thread Dag-Erling Smørgrav
Julian Elischer <[EMAIL PROTECTED]> writes:
> Robert Watson <[EMAIL PROTECTED]> writes:
> > be a good time to try to revalidate that.  Basically, the goal would
> > be to make the pcpu cache FIFO as much as possible as that maximizes
> > the chances that the newly allocated object already has lines in the
> > cache.  It's a fairly trivial tweak to the UMA allocation code.
> you mean FILO or LIFO right?

Uh, no.  You want to reuse the last-freed object, as it is most likely
to still be in cache.

DES
-- 
Dag-Erling Smørgrav - [EMAIL PROTECTED]
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Julian Elischer

Robert Watson wrote:

be a good time to try to revalidate that.  Basically, the goal would be 
to make the pcpu cache FIFO as much as possible as that maximizes the 


you mean FILO or LIFO right?

chances that the newly allocated object already has lines in the cache.  
It's a fairly trivial tweak to the UMA allocation code.


Robert N M Watson
Computer Laboratory
University of Cambridge


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Robert Watson

On Sun, 3 Feb 2008, Alexander Motin wrote:


Robert Watson wrote:
Basically, the goal would be to make the pcpu cache FIFO as much as 
possible as that maximizes the chances that the newly allocated object 
already has lines in the cache.


Why FIFO? I think LIFO (stack) should be better for this goal as the last 
freed object has more chances to be still present in cache.


Sorry, brain-o -- indeed, as I described, LIFO, rather than as a I wrote. :-)

Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Robert Watson wrote:
Basically, the goal would be 
to make the pcpu cache FIFO as much as possible as that maximizes the 
chances that the newly allocated object already has lines in the cache.  


Why FIFO? I think LIFO (stack) should be better for this goal as the 
last freed object has more chances to be still present in cache.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Max Laier

Am Sa, 2.02.2008, 23:05, schrieb Alexander Motin:
> Robert Watson wrote:
>> Hence my request for drilling down a bit on profiling -- the question
>> I'm asking is whether profiling shows things running or taking time that
>> shouldn't be.
>
> I have not yet understood why does it happend, but hwpmc shows huge
> amount of "p4-resource-stall"s in UMA functions:
>%   cumulative   self  self total
>   time   seconds   secondscalls  ms/call  ms/call  name
>   45.22303.00  2303.000  100.00%   uma_zfree_arg [1]
>   41.24402.00  2099.000  100.00%   uma_zalloc_arg [2]
>1.44472.0070.000  100.00%
> uma_zone_exhausted_nolock [3]
>0.94520.0048.000  100.00%   ng_snd_item [4]
>0.84562.0042.000  100.00%   __qdivrem [5]
>0.84603.0041.000  100.00%   ether_input [6]
>0.64633.0030.000  100.00%   ng_ppp_prepend [7]
>
> Probably it explains why "p4-global-power-events" shows many hits into
> them
>%   cumulative   self  self total
>   time   seconds   secondscalls  ms/call  ms/call  name
>   20.0   37984.00 37984.000  100.00%   uma_zfree_arg [1]
>   17.8   71818.00 33834.000  100.00%   uma_zalloc_arg [2]
>4.0   79483.00  7665.000  100.00%   ng_snd_item [3]
>3.0   85256.00  5773.000  100.00%   __mcount [4]
>2.3   89677.00  4421.000  100.00%   bcmp [5]
>2.2   93853.00  4176.000  100.00%   generic_bcopy [6]
>
> , while "p4-instr-retired" does not.
>%   cumulative   self  self total
>   time   seconds   secondscalls  ms/call  ms/call  name
>   11.15351.00  5351.000  100.00%   ng_apply_item [1]
>7.99178.00  3827.000  100.00%
> legacy_pcib_alloc_msi [2]
>4.1   11182.00  2004.000  100.00%   init386 [3]
>4.0   13108.00  1926.000  100.00%   rn_match [4]
>3.5   14811.00  1703.000  100.00%   uma_zalloc_arg [5]
>2.6   16046.00  1235.000  100.00%   SHA256_Transform
> [6]
>2.2   17130.00  1084.000  100.00%   ng_add_hook [7]
>2.0   18111.00   981.000  100.00%   ng_rmhook_self [8]
>2.0   19054.00   943.000  100.00%   em_encap [9]
>
> For this moment I have invent two possible explanation. One is that due
> to UMA's cyclic block allocation order it does not fits CPU caches and
> another that it is somehow related to critical_exit(), which possibly
> can cause context switch. Does anybody have better explanation how such
> small and simple in this part function can cause such results?

I didn't see bzero accounted for in any of the traces in this thread -
makes me wonder if that might mean that it's counted within uma_zalloc? 
Maybe we are calling it twice by accident?  I wasn't quite able to figure
out the logic of M_ZERO vs. UMA_ZONE_MALLOC etc. ... just a crazy idea.

-- 
/"\  Best regards,  | [EMAIL PROTECTED]
\ /  Max Laier  | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | [EMAIL PROTECTED]
/ \  ASCII Ribbon Campaign  | Against HTML Mail and News
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Robert Watson

On Sat, 2 Feb 2008, Kris Kennaway wrote:


Alexander Motin wrote:

Robert Watson wrote:
Hence my request for drilling down a bit on profiling -- the question I'm 
asking is whether profiling shows things running or taking time that 
shouldn't be.


I have not yet understood why does it happend, but hwpmc shows huge amount 
of "p4-resource-stall"s in UMA functions:


For this moment I have invent two possible explanation. One is that due to 
UMA's cyclic block allocation order it does not fits CPU caches and another 
that it is somehow related to critical_exit(), which possibly can cause 
context switch. Does anybody have better explanation how such small and 
simple in this part function can cause such results?


You can look at the raw output from pmcstat, which is a collection of 
instruction pointers that you can feed to e.g. addr2line to find out exactly 
where in those functions the events are occurring.  This will often help to 
track down the precise causes.


There was, FYI, a report a few years ago that there was a measurable 
improvement from allocating off the free bucket rather than maintaining 
separate alloc and free buckets.  It sounded good at the time but I was never 
able to reproduce the benefits in my test environment.  Now might be a good 
time to try to revalidate that.  Basically, the goal would be to make the pcpu 
cache FIFO as much as possible as that maximizes the chances that the newly 
allocated object already has lines in the cache.  It's a fairly trivial tweak 
to the UMA allocation code.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Kris Kennaway

Alexander Motin wrote:

Robert Watson wrote:
Hence my request for drilling down a bit on profiling -- the question 
I'm asking is whether profiling shows things running or taking time 
that shouldn't be.


I have not yet understood why does it happend, but hwpmc shows huge 
amount of "p4-resource-stall"s in UMA functions:


For this moment I have invent two possible explanation. One is that due 
to UMA's cyclic block allocation order it does not fits CPU caches and 
another that it is somehow related to critical_exit(), which possibly 
can cause context switch. Does anybody have better explanation how such 
small and simple in this part function can cause such results?


You can look at the raw output from pmcstat, which is a collection of 
instruction pointers that you can feed to e.g. addr2line to find out 
exactly where in those functions the events are occurring.  This will 
often help to track down the precise causes.


Kris


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Robert Watson wrote:
Hence my request for drilling down a bit on profiling -- the question 
I'm asking is whether profiling shows things running or taking time that 
shouldn't be.


I have not yet understood why does it happend, but hwpmc shows huge 
amount of "p4-resource-stall"s in UMA functions:

  %   cumulative   self  self total
 time   seconds   secondscalls  ms/call  ms/call  name
 45.22303.00  2303.000  100.00%   uma_zfree_arg [1]
 41.24402.00  2099.000  100.00%   uma_zalloc_arg [2]
  1.44472.0070.000  100.00% 
uma_zone_exhausted_nolock [3]

  0.94520.0048.000  100.00%   ng_snd_item [4]
  0.84562.0042.000  100.00%   __qdivrem [5]
  0.84603.0041.000  100.00%   ether_input [6]
  0.64633.0030.000  100.00%   ng_ppp_prepend [7]

Probably it explains why "p4-global-power-events" shows many hits into them
  %   cumulative   self  self total
 time   seconds   secondscalls  ms/call  ms/call  name
 20.0   37984.00 37984.000  100.00%   uma_zfree_arg [1]
 17.8   71818.00 33834.000  100.00%   uma_zalloc_arg [2]
  4.0   79483.00  7665.000  100.00%   ng_snd_item [3]
  3.0   85256.00  5773.000  100.00%   __mcount [4]
  2.3   89677.00  4421.000  100.00%   bcmp [5]
  2.2   93853.00  4176.000  100.00%   generic_bcopy [6]

, while "p4-instr-retired" does not.
  %   cumulative   self  self total
 time   seconds   secondscalls  ms/call  ms/call  name
 11.15351.00  5351.000  100.00%   ng_apply_item [1]
  7.99178.00  3827.000  100.00% 
legacy_pcib_alloc_msi [2]

  4.1   11182.00  2004.000  100.00%   init386 [3]
  4.0   13108.00  1926.000  100.00%   rn_match [4]
  3.5   14811.00  1703.000  100.00%   uma_zalloc_arg [5]
  2.6   16046.00  1235.000  100.00%   SHA256_Transform [6]
  2.2   17130.00  1084.000  100.00%   ng_add_hook [7]
  2.0   18111.00   981.000  100.00%   ng_rmhook_self [8]
  2.0   19054.00   943.000  100.00%   em_encap [9]

For this moment I have invent two possible explanation. One is that due 
to UMA's cyclic block allocation order it does not fits CPU caches and 
another that it is somehow related to critical_exit(), which possibly 
can cause context switch. Does anybody have better explanation how such 
small and simple in this part function can cause such results?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Peter Jeremy
On Sat, Feb 02, 2008 at 09:56:42PM +0200, Alexander Motin wrote:
>Peter Jeremy ?:
>> On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
>>> To check UMA dependency I have made a trivial one-element cache which in 
>>> my test case allows to avoid two for four allocations per packet.
>> 
>> You should be able to implement this lockless using atomic(9).  I haven't
>> verified it, but the following should work.
>
>I have tried this, but man 9 atomic talks:
>
>The atomic_readandclear() functions are not implemented for the types
>``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any 
>variants with memory barriers at this time.

Hmmm.  This seems to be more a documentation bug than missing code:
atomic_readandclear_ptr() seems to be implemented on most
architectures (the only one where I can't find it is arm) and is
already used in malloc(3).

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.


pgpr4vs01cZsJ.pgp
Description: PGP signature


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Peter Jeremy пишет:

On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
To check UMA dependency I have made a trivial one-element cache which in my 
test case allows to avoid two for four allocations per packet.


You should be able to implement this lockless using atomic(9).  I haven't
verified it, but the following should work.


I have tried this, but man 9 atomic talks:

The atomic_readandclear() functions are not implemented for the types
``char'', ``short'', ``ptr'', ``8'', and ``16'' and do not have any 
variants with memory barriers at this time.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Peter Jeremy
On Sat, Feb 02, 2008 at 11:31:31AM +0200, Alexander Motin wrote:
>To check UMA dependency I have made a trivial one-element cache which in my 
>test case allows to avoid two for four allocations per packet.

You should be able to implement this lockless using atomic(9).  I haven't
verified it, but the following should work.

>.alloc.
>-   item = uma_zalloc(ng_qzone, wait | M_ZERO);

>+   mtx_lock_spin(&itemcachemtx);
>+   item = itemcache;
>+   itemcache = NULL;
>+   mtx_unlock_spin(&itemcachemtx);
 =   item = atomic_readandclear_ptr(&itemcache);

>+   if (item == NULL)
>+   item = uma_zalloc(ng_qzone, wait | M_ZERO);
>+   else
>+   bzero(item, sizeof(*item));

>.free.
>-   uma_zfree(ng_qzone, item);

>+   mtx_lock_spin(&itemcachemtx);
>+   if (itemcache == NULL) {
>+   itemcache = item;
>+   item = NULL;
>+   }
>+   mtx_unlock_spin(&itemcachemtx);
>+   if (item)
>+   uma_zfree(ng_qzone, item);
 =   if (atomic_cmpset_ptr(&itemcache, NULL, item) == 0)
 =   uma_zfree(ng_qzone, item);

-- 
Peter Jeremy
Please excuse any delays as the result of my ISP's inability to implement
an MTA that is either RFC2821-compliant or matches their claimed behaviour.


pgp4quq6NRpnZ.pgp
Description: PGP signature


Re: Memory allocation performance

2008-02-02 Thread Joseph Koshy
> Thanks, I have already found this. There was only problem, that by
> default it counts cycles only when both logical cores are active while
> one of my cores was halted.

Did you try the 'active' event modifier: "p4-global-power-events,active=any"?

> Sampling on this, profiler shown results close to usual profiling, but
> looking more random:

Adding '-fno-omit-frame-pointer' to CFLAGS may help hwpmc to capture
callchains better.

Regards,
Koshy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Joseph Koshy wrote:

You cannot sample with the TSC since the TSC does not interrupt the CPU.
For CPU cycles you would probably want to use "p4-global-power-events";
see pmc(3).


Thanks, I have already found this. There was only problem, that by 
default it counts cycles only when both logical cores are active while 
one of my cores was halted.
Sampling on this, profiler shown results close to usual profiling, but 
looking more random:


 175.97 1.49   1/64  ip_input  [49]
 175.97 1.49   1/64  g_alloc_bio [81]
 175.97 1.49   1/64  ng_package_data [18]
1055.81 8.93   6/64  em_handle_rxtx [4]
2639.5322.32  15/64  em_get_buf [19]
3343.4128.27  19/64  ng_getqblk [17]
3695.3431.25  21/64  ip_forward  [14]
[9]21.6 11262.00   95.23  64 uma_zalloc_arg [9]
  35.4513.03   5/22  critical_exit [75]
  26.86 0.00  22/77  critical_enter [99]
  19.89 0.00  18/19  mb_ctor_mbuf [141]


  31.87 0.24   4/1324ng_ether_rcvdata [13]
  31.87 0.24   4/1324ip_forward  [14]
  95.60 0.73  12/1324ng_iface_rcvdata  
[16]

 103.57 0.79  13/1324m_freem [25]
 876.34 6.71 110/1324mb_free_ext [30]
9408.7572.011181/1324ng_free_item [11]
[10]20.2 10548.00  80.731324 uma_zfree_arg [10]
  26.86 0.00  22/77  critical_enter [99]
  15.0011.59   7/7   mb_dtor_mbuf [134]
  19.00 6.62   4/4   mb_dtor_pack [136]
   1.66 0.00   1/32  m_tag_delete_chain [114]


 21.4   11262.00 11262.00   64 175968.75 177456.76  uma_zalloc_arg [9]
 20.1   21810.00 10548.00 1324  7966.77  8027.74  uma_zfree_arg [10]
  5.6   24773.00  2963.00 1591  1862.35  2640.07  ng_snd_item 
 [15]

  3.5   26599.00  1826.00   33 55333.33 55333.33  ng_address_hook [20]
  2.4   27834.00  1235.00  319  3871.47  3871.47  ng_acquire_read [28]

To make statistics better I need to record sampling data with smaller 
period, but too much data creates additional overhead including disc 
operations and brakes statistics. Is there any way to make it more 
precise? What sampling parameters should I use for better results?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Joseph Koshy
> I have tried it for measuring number of instructions. But I am in doubt
> that instructions is a correct counter for performance measurement as
> different instructions may have very different execution times depending
> on many reasons, like cache misses and current memory traffic. I have
> tried to use tsc to count CPU cycles, but got the error:
> # pmcstat -n 1 -S "tsc" -O sample.out
> pmcstat: ERROR: Cannot allocate system-mode pmc with specification
> "tsc": Operation not supported
> What have I missed?

You cannot sample with the TSC since the TSC does not interrupt the CPU.
For CPU cycles you would probably want to use "p4-global-power-events";
see pmc(3).

Regards,
Koshy
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Robert Watson


On Sat, 2 Feb 2008, Alexander Motin wrote:


Robert Watson wrote:
I guess the question is: where are the cycles going?  Are we suffering 
excessive cache misses in managing the slabs?  Are you effectively "cycling 
through" objects rather than using a smaller set that fits better in the 
cache?


In my test setup only several objects from zone usually allocated same time, 
but they allocated two times per every packet.


To check UMA dependency I have made a trivial one-element cache which in my 
test case allows to avoid two for four allocations per packet.


Avoiding unnecessary allocations is a good general principle, but duplicating 
cache logic is a bad idea.  If you're able to structure the below without 
using locking, it strikes me you'd do much better, especially if it's in a 
single processing pass.  Can you not use a per-thread/stack/session variable 
to avoid that?



.alloc.
-   item = uma_zalloc(ng_qzone, wait | M_ZERO);
+   mtx_lock_spin(&itemcachemtx);
+   item = itemcache;
+   itemcache = NULL;
+   mtx_unlock_spin(&itemcachemtx);


Why are you using spin locks?  They are quite a bit more expensive on several 
hardwawre platforms, and any environment it's safe to call uma_zalloc() from 
will be equally safe to use regular mutexes from (i.e., mutex-sleepable).



+   if (item == NULL)
+   item = uma_zalloc(ng_qzone, wait | M_ZERO);
+   else
+   bzero(item, sizeof(*item));
.free.
-   uma_zfree(ng_qzone, item);
+   mtx_lock_spin(&itemcachemtx);
+   if (itemcache == NULL) {
+   itemcache = item;
+   item = NULL;
+   }
+   mtx_unlock_spin(&itemcachemtx);
+   if (item)
+   uma_zfree(ng_qzone, item);
...

To be sure that test system is CPU-bound I have throttled it with sysctl to 
1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has grown 
from 17 to 21Mbytes/s. Profiling results I have sent promised close results.


Is some bit of debugging enabled that shouldn't be, perhaps due to a 
failure of ifdefs?


I have commented out all INVARIANTS and WITNESS options from GENERIC kernel 
config. What else should I check?


Hence my request for drilling down a bit on profiling -- the question I'm 
asking is whether profiling shows things running or taking time that shouldn't 
be.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-02 Thread Alexander Motin

Robert Watson wrote:
I guess the question is: where are the cycles going?  Are we suffering 
excessive cache misses in managing the slabs?  Are you effectively 
"cycling through" objects rather than using a smaller set that fits 
better in the cache?


In my test setup only several objects from zone usually allocated same 
time, but they allocated two times per every packet.


To check UMA dependency I have made a trivial one-element cache which in 
my test case allows to avoid two for four allocations per packet.

.alloc.
-   item = uma_zalloc(ng_qzone, wait | M_ZERO);
+   mtx_lock_spin(&itemcachemtx);
+   item = itemcache;
+   itemcache = NULL;
+   mtx_unlock_spin(&itemcachemtx);
+   if (item == NULL)
+   item = uma_zalloc(ng_qzone, wait | M_ZERO);
+   else
+   bzero(item, sizeof(*item));
.free.
-   uma_zfree(ng_qzone, item);
+   mtx_lock_spin(&itemcachemtx);
+   if (itemcache == NULL) {
+   itemcache = item;
+   item = NULL;
+   }
+   mtx_unlock_spin(&itemcachemtx);
+   if (item)
+   uma_zfree(ng_qzone, item);
...

To be sure that test system is CPU-bound I have throttled it with sysctl 
to 1044MHz. With this patch my test PPPoE-to-PPPoE router throughput has 
grown from 17 to 21Mbytes/s. Profiling results I have sent promised 
close results.


Is some bit of debugging enabled that shouldn't 
be, perhaps due to a failure of ifdefs?


I have commented out all INVARIANTS and WITNESS options from GENERIC 
kernel config. What else should I check?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-01 Thread Bruce Evans

On Fri, 1 Feb 2008, Alexander Motin wrote:


Robert Watson wrote:
It would be very helpful if you could try doing some analysis with hwpmc -- 
"high resolution profiling" is of increasingly limited utility with modern


You mean "of increasingly greater utility with modern CPUs".  Low resolution
kernel profiling stopped giving enough resolution in about 1990, and has
become of increasingly limited utility since then, but high resolution
kernel profiling uses the TSC or possibly a performance counter so it
has kept up with CPU speedups.  Cache effects and out of order execution
are larger now, but they affect all types of profiling and still not too
bad with high resulotion kernel profiling.  High resolution kernel profiling
doesn't really work with SMP, but that is not Alexander's problem since he
profiled under UP.

CPUs, where even a high frequency timer won't run very often.  It's also 
quite subject to cycle events that align with other timers in the system.


No, it isn't affected by either of these.  The TSC timer is incremented on
every CPU cycle and the performance counters run are incremented on every
event.  It is completely unaffected by other timers.

I have tried hwpmc but still not completely friendly with it. Whole picture 
is somewhat alike to kgmon's, but it looks very noisy. Is there some "know 
how" about how to use it better?


hwpmc doesn't work for me either.  I can't see how it could work as well
as high resolution kernel profiling for events at the single function
level, since it is statistics-based.  With the statistics clock interrupt
rate fairly limited, it just cannot get enough resolution over short runs.
Also, it works poorly for me (with a current kernel and ~5.2 userland
except for some utilities like pmc*).  Generation of profiles stopped
working for me, and it often fails with allocation errors.

I have tried it for measuring number of instructions. But I am in doubt that 
instructions is a correct counter for performance measurement as different 
instructions may have very different execution times depending on many 
reasons, like cache misses and current memory traffic.


Cycle counts are more useful, but high resolution kernel profiling can do
this too, with some fixes:
- update perfmon for newer CPUs.  It is broken even for Athlons (takes a
  2 line fix, or more lines with proper #defines and if()s).
- select the performance counter to be used for profiling using
  sysctl machdep.cputime_clock=$((5 + N)) where N is the number of the
  performance counter for the instruction count (or any).  I use hwpmc
  mainly to determine N :-).  It may also be necessary to change the
  kernel variable cpu_clock_pmc_conf.  Configuration of this is unfinished.
- use high resolution kernel profiling normally.  Note that switching to
  a perfmon counter is only permitted of !SMP (since it is too unsupported
  under SMP to do more than panic if permitted under SMP), and that the
  switch loses the calibration of profiling.  Profiling normally
  compensates for overheads of the profiling itself, and the compensation
  would work almoost perfectly for event counters, unlike for time-related
  counters, since the extra events for profiling aren't much affected by
  caches.

I have tried to use 
tsc to count CPU cycles, but got the error:

# pmcstat -n 1 -S "tsc" -O sample.out
pmcstat: ERROR: Cannot allocate system-mode pmc with specification "tsc": 
Operation not supported

What have I missed?


This might be just because the TSC really is not supported.  Many things
require an APIC for hwpmc to support them.

I get errors allocation like this for operations that work a few times
before failing.

I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but kernel 
built without SMP to simplify profiling. What counters can you recommend me 
to use on it for regular time profiling?


Try them all :-).  From userland first with an overall count, since looking
at the details in gprof output takes too long (and doesn't work for me with
hwpmc anyway).  I use scripts like the following to try them all from
userland:

runpm:
%%%
c="ttcp -n10 -l5 -u -t epsplex"

ctr=0
while test $ctr -lt 256
do
ctr1=$(printf "0x%02x\n" $ctr)
case $ctr1 in
0x00)   src=k8-fp-dispatched-fpu-ops;;
0x01)   src=k8-fp-cycles-with-no-fpu-ops-retired;;
0x02)   src=k8-fp-dispatched-fpu-fast-flag-ops;;
0x05)   src=k8-fp-unknown-$ctr1;;
0x09)   src=k8-fp-unknown-$ctr1;;
0x0d)   src=k8-fp-unknown-$ctr1;;
0x11)   src=k8-fp-unknown-$ctr1;;
0x15)   src=k8-fp-unknown-$ctr1;;
0x19)   src=k8-fp-unknown-$ctr1;;
0x1d)   src=k8-fp-unknown-$ctr1;;
0x20)   src=k8-ls-segment-register-load;;   # XXX
0x21)   src=kx-ls-microarchitectural-resync-by-self-mod-code;;
0x22)   src=k8-ls-microarchitectural-resync-by-snoop;;
0x23)   src=kx-ls-buffer2-full;;
0x24)   src=k8-ls-locked-operation;;  

Re: Memory allocation performance

2008-02-01 Thread Alexander Motin

Hi.

Robert Watson wrote:
It would be very helpful if you could try doing some analysis with hwpmc 
-- "high resolution profiling" is of increasingly limited utility with 
modern CPUs, where even a high frequency timer won't run very often.  
It's also quite subject to cycle events that align with other timers in 
the system.


I have tried hwpmc but still not completely friendly with it. Whole 
picture is somewhat alike to kgmon's, but it looks very noisy. Is there 
some "know how" about how to use it better?


I have tried it for measuring number of instructions. But I am in doubt 
that instructions is a correct counter for performance measurement as 
different instructions may have very different execution times depending 
on many reasons, like cache misses and current memory traffic. I have 
tried to use tsc to count CPU cycles, but got the error:

# pmcstat -n 1 -S "tsc" -O sample.out
pmcstat: ERROR: Cannot allocate system-mode pmc with specification 
"tsc": Operation not supported

What have I missed?

I am now using Pentium4 Prescott CPU with HTT enabled in BIOS, but 
kernel built without SMP to simplify profiling. What counters can you 
recommend me to use on it for regular time profiling?


Thanks for reply.

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-01 Thread Robert Watson


On Fri, 1 Feb 2008, Alexander Motin wrote:

That was actually my second question. As there is only 512 items by default 
and they are small in size I can easily preallocate them all on boot. But is 
it a good way? Why UMA can't do just the same when I have created zone with 
specified element size and maximum number of objects? What is the principal 
difference?


Alexander,

I think we should drill down in the analysis a bit and see if we can figure 
out what's going on with UMA.  What UMA essentially does is ask the VM for 
pages, and then pack objects into pages.  It maintains some meta-data, and 
depending on the relative sizes of objects and pages, it may store it in the 
page or potentially elsewhere.  Either way, it looks very much an array of 
struct object.  It has a few extra layers of wrapping in order to maintain 
stats, per-CPU caches, object life cycle, etc.  When INVARIANTS is turned off, 
allocation from the per-CPU cache consists of pulling objects in and out of 
one of two per-CPU queues.  So I guess the question is: where are the cycles 
going?  Are we suffering excessive cache misses in managing the slabs?  Are 
you effectively "cycling through" objects rather than using a smaller set that 
fits better in the cache?  Is some bit of debugging enabled that shouldn't be, 
perhaps due to a failure of ifdefs?


BTW, UMA does let you set the size of buckets, so you can try tuning the 
bucket size.  For starts, try setting the zone flag UMA_ZONE_MAXBUCKET.


It would be very helpful if you could try doing some analysis with hwpmc -- 
"high resolution profiling" is of increasingly limited utility with modern 
CPUs, where even a high frequency timer won't run very often.  It's also quite 
subject to cycle events that align with other timers in the system.


Robert N M Watson
Computer Laboratory
University of Cambridge
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-02-01 Thread Kris Kennaway

Alexander Motin wrote:

Kris Kennaway пишет:

Alexander Motin wrote:

Alexander Motin пишет:
While profiling netgraph operation on UP HEAD router I have found 
that huge amount of time it spent on memory allocation/deallocation:


I have forgotten to tell that it was mostly GENERIC kernel just built 
without INVARIANTS, WITNESS and SMP but with 'profile 2'.


What is 'profile 2'?


I have thought it is high resolution profiling support. Isn't it?



OK.  This is not commonly used so I don't know if it works.  Try using 
hwpmc if possible to compare.


When you say that your own allocation routines show less time use under 
profiling, how do they affect the actual system performance?


Kris

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Julian Elischer

Alexander Motin wrote:

Julian Elischer пишет:

Alexander Motin wrote:

Hi.

While profiling netgraph operation on UP HEAD router I have found 
that huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells 
that 30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it 
in some places with preallocated TAILQ of required memory blocks 
protected by mutex and according to profiler I have got _much_ better 
results. Will it be a good practice to replace relatively small UMA 
zones with preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but 
does it cost so much that it costs 30% CPU time on UP router?


given this information, I would add an 'item cache' in ng_base.c
(hmm do I already have one?)


That was actually my second question. As there is only 512 items by 
default and they are small in size I can easily preallocate them all on 
boot. But is it a good way? Why UMA can't do just the same when I have 
created zone with specified element size and maximum number of objects? 
What is the principal difference?




who knows what uma does.. but if you do it yourself you know what the 
overhead is.. :-)


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Alexander Motin

Julian Elischer пишет:

Alexander Motin wrote:

Hi.

While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells that 
30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it 
in some places with preallocated TAILQ of required memory blocks 
protected by mutex and according to profiler I have got _much_ better 
results. Will it be a good practice to replace relatively small UMA 
zones with preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does 
it cost so much that it costs 30% CPU time on UP router?


given this information, I would add an 'item cache' in ng_base.c
(hmm do I already have one?)


That was actually my second question. As there is only 512 items by 
default and they are small in size I can easily preallocate them all on 
boot. But is it a good way? Why UMA can't do just the same when I have 
created zone with specified element size and maximum number of objects? 
What is the principal difference?


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Alexander Motin

Kris Kennaway пишет:

Alexander Motin wrote:

Alexander Motin пишет:
While profiling netgraph operation on UP HEAD router I have found 
that huge amount of time it spent on memory allocation/deallocation:


I have forgotten to tell that it was mostly GENERIC kernel just built 
without INVARIANTS, WITNESS and SMP but with 'profile 2'.


What is 'profile 2'?


I have thought it is high resolution profiling support. Isn't it?

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Kris Kennaway

Alexander Motin wrote:

Alexander Motin пишет:
While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


I have forgotten to tell that it was mostly GENERIC kernel just built 
without INVARIANTS, WITNESS and SMP but with 'profile 2'.




What is 'profile 2'?

Kris
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Julian Elischer

Alexander Motin wrote:

Hi.

While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells that 
30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it in 
some places with preallocated TAILQ of required memory blocks protected 
by mutex and according to profiler I have got _much_ better results. 
Will it be a good practice to replace relatively small UMA zones with 
preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does 
it cost so much that it costs 30% CPU time on UP router?


given this information, I would add an 'item cache' in ng_base.c
(hmm do I already have one?)




Thanks!



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Kris Kennaway

Alexander Motin wrote:

Hi.

While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells that 
30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it in 
some places with preallocated TAILQ of required memory blocks protected 
by mutex and according to profiler I have got _much_ better results. 
Will it be a good practice to replace relatively small UMA zones with 
preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does 
it cost so much that it costs 30% CPU time on UP router?


Make sure you have INVARIANTS disabled, it has a high performance cost 
in UMA.


Kris
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Memory allocation performance

2008-01-31 Thread Alexander Motin

Alexander Motin пишет:
While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


I have forgotten to tell that it was mostly GENERIC kernel just built 
without INVARIANTS, WITNESS and SMP but with 'profile 2'.


--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Memory allocation performance

2008-01-31 Thread Alexander Motin

Hi.

While profiling netgraph operation on UP HEAD router I have found that 
huge amount of time it spent on memory allocation/deallocation:


0.14  0.05  132119/545292  ip_forward  [12]
0.14  0.05  133127/545292  fxp_add_rfabuf [18]
0.27  0.10  266236/545292  ng_package_data [17]
[9]14.1 0.56  0.21  545292 uma_zalloc_arg [9]
0.17  0.00  545292/1733401 critical_exit  [98]
0.01  0.00  275941/679675  generic_bzero [68]
0.01  0.00  133127/133127  mb_ctor_pack [103]

0.15  0.06  133100/545266  mb_free_ext [22]
0.15  0.06  133121/545266  m_freem [15]
0.29  0.11  266236/545266  ng_free_item [16]
[8]15.2 0.60  0.23  545266 uma_zfree_arg [8]
0.17  0.00  545266/1733401 critical_exit  [98]
0.00  0.04  133100/133100  mb_dtor_pack [57]
0.00  0.00  134121/134121  mb_dtor_mbuf [111]

I have already optimized all possible allocation calls and those that 
left are practically unavoidable. But even after this kgmon tells that 
30% of CPU time consumed by memory management.


So I have some questions:
1) Is it real situation or just profiler mistake?
2) If it is real then why UMA is so slow? I have tried to replace it in 
some places with preallocated TAILQ of required memory blocks protected 
by mutex and according to profiler I have got _much_ better results. 
Will it be a good practice to replace relatively small UMA zones with 
preallocated queue to avoid part of UMA calls?
3) I have seen that UMA does some kind of CPU cache affinity, but does 
it cost so much that it costs 30% CPU time on UP router?


Thanks!

--
Alexander Motin
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"