On Sat, 23 Dec 2006, I wrote:

The problem becomes smaller as the read block size appoaches the file
system block size and vanishes when the sizes are identical.  Then
there is apparently a different (smaller) problem:

Read size 16K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 1.15 seconds

 %   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
49.1      0.565    0.565    25643    22037    22037  copyout [11]
12.6      0.710    0.145        0  100.00%           mcount [14]
 8.8      0.811    0.101    87831     1153     1153  vm_page_splay [17]
 7.0      0.892    0.081   112906      715      715  buf_splay [19]
 6.1      0.962    0.070        0  100.00%           mexitcount [20]
 3.4      1.000    0.039        0  100.00%           cputime [22]
 1.2      1.013    0.013    86883      153      181  vm_page_unwire [28]
 1.1      1.027    0.013        0  100.00%           user [29]
 1.1      1.040    0.013    21852      595     3725  getnewbuf [18]
%%%

Read size 16K, sequential:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 0.96 seconds

 %   cumulative   self              self     total
time   seconds   seconds    calls  ns/call  ns/call  name
57.1      0.550    0.550    25643    21464    21464  copyout [11]
14.2      0.687    0.137        0  100.00%           mcount [12]
 6.9      0.754    0.066        0  100.00%           mexitcount [15]
 4.2      0.794    0.040   102830      391      391  vm_page_splay [19]
 3.8      0.830    0.037        0  100.00%           cputime [20]
 1.4      0.844    0.013   102588      130      130  buf_splay [22]
 1.3      0.856    0.012    25603      488     1920  getnewbuf [17]
 1.0      0.866    0.009    25606      368      368  pmap_qremove [24]
%%%

Now the splay routines are called almost the same number of times, but
take much longer in the random case.  buf_splay() seems to be unrelated
to vm -- it is called from gbincore() even if the buffer is already
in the buffer cache.  It seems quite slow for that -- almost 1 uS just
to look up compared with 21 uS to copyout a 16K buffer.  Linux-sized
buffers would take only 1.5 uS and then 1 uS to look them up is clearly
too much.  Another benchmark shows gbincore() taking 501 nS each to
look up 64 in-buffer-cache buffers for 1MB file -- this must be the
best case for it (all these times are for -current on an Athlon XP2700
overclocked to 2025MHz).  The generic hash function used in my compiler
takes 40 nS to hash a 16-byte string on this machine.

FreeBSD-~4.10 is faster.  The difference is especially noticeable when
the read size is the same as the fs block size (16K, as above).  Then
I get the following speeds:

~4.10, random:     580MB/S
~4.10, sequential: 580MB/S
~5.2, random:      575MB/S
~5.2, sequential:  466MB/S

All with kernel profiling not configured, and no INVARIANTS etc.

~5.2 is quite different from -current, but it has buf_splay() and
vm_page_splay(), and behaves similarly in this benchmark.

With profiling ~4.10, read size 16K, sequential +some random:
%%%
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
 51.1      0.547    0.547    25643    21323    21323  generic_copyout [9]
 17.3      0.732    0.185        0  100.00%           mcount [10]
  7.9      0.817    0.085        0  100.00%           mexitcount [13]
  5.0      0.870    0.053        0  100.00%           cputime [16]
  1.9      0.891    0.020    51207      395      395  gbincore [20]
                                                (424 for random)
  1.4      0.906    0.015   102418      150      253  vm_page_wire [18]
                                                (322)
  1.3      0.920    0.014   231218       62       62  splvm [23]
  1.3      0.934    0.014    25603      541     2092  allocbuf [15]
                                               (2642)
  1.0      0.945    0.010   566947       18       18  splx <cycle 1> [25]
  1.0      0.955    0.010   102122      100      181  vm_page_unwire [21]
  0.9      0.964    0.009    25606      370      370  pmap_qremove [27]
  0.9      0.973    0.009    25603      359     2127  getnewbuf [14]
                                               (2261)
%%%

There is little difference for the sequential case, but the old gbincore()
and buffer allocation routines are much faster for the random case.

With profiling ~4.10, read size 4K, random:
%%%
granularity: each sample hit covers 16 byte(s) for 0.00% of 2.63 seconds

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ns/call  ns/call  name
 27.3      0.720    0.720        0  100.00%           mcount [8]
 22.5      1.312    0.592   102436     5784     5784  generic_copyout [10]
 12.6      1.643    0.331        0  100.00%           mexitcount [13]
  7.9      1.850    0.207        0  100.00%           cputime [15]
  2.9      1.926    0.076   189410      402      402  gbincore [20]
  2.3      1.988    0.061   348029      176      292  vm_page_wire [18]
  2.2      2.045    0.058    87010      662     2500  allocbuf [14]
  2.0      2.099    0.053   783280       68       68  splvm [22]
  1.6      2.142    0.043        0   99.33%           user [24]
  1.6      2.184    0.042  2041759       20       20  splx <cycle 3> [26]
  1.3      2.217    0.034   347298       97      186  vm_page_unwire [21]
  1.2      2.249    0.032    86895      370      370  pmap_qremove [28]
  1.1      2.279    0.029    87006      337     2144  getnewbuf [16]
  0.9      2.303    0.024    86891      280     1617  vfs_vmio_release [17]
%%%

Now the result is little different from -current -- the random case is
almost as slow as in -current according to the total time, although this
may be an artifact of profiling (allocbuf takes 2500 nS total in ~4.10
vs 4025 nS in -current).

Bruce
_______________________________________________
freebsd-performance@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-performance
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to