Hi,

On 2020-05-03 09:12:59 -0400, Robert Haas wrote:
> On Sat, May 2, 2020 at 10:36 PM Andres Freund <and...@anarazel.de> wrote:
> > I changed Robert's test program to optionall fallocate,
> > sync_file_range(WRITE), posix_fadvise(DONTNEED), to avoid a large
> > footprint in the page cache. The performance
> > differences are quite substantial:
> >
> > gcc -Wall -ggdb ~/tmp/write_and_fsync.c -o /tmp/write_and_fsync && \
> >     rm -ff /srv/dev/bench/test* && echo 3 |sudo tee 
> > /proc/sys/vm/drop_caches && \
> >     /tmp/write_and_fsync --sync_file_range=0 --fallocate=0 --fadvise=0 
> > --filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1
> >
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 
> > fallocate=0 sfr=0 fadvise=0
> > [/srv/dev/bench/test1][11450] open: 0, fallocate: 0 write: 214, fsync: 6, 
> > close: 0, total: 220
> >
> > comparing that with --sync_file_range=1 --fallocate=1 --fadvise=1
> > running test with: numprocs=1 filesize=429496729600 blocksize=8192 
> > fallocate=1 sfr=1 fadvise=1
> > [/srv/dev/bench/test1][14098] open: 0, fallocate: 0 write: 161, fsync: 0, 
> > close: 0, total: 161
>
> Ah, nice.

Btw, I forgot to include the result for 0 / 0 / 0 in the results
(off-by-one error in a script :))

numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0     
220.210155081     1,569,524,602,961     178.188M/sec    1,363,686,761,705    
0.155GHz        833,345,334,408      0.68

> > The run-to-run variations between the runs without cache control are
> > pretty large. So this is probably not the end-all-be-all numbers. But I
> > think the trends are pretty clear.
>
> Could you be explicit about what you think those clear trends are?

Largely that concurrency can help a bit, but also hurt
tremendously. Below is some more detailed analysis, it'll be a bit
long...

Taking the no concurrency / cache management as a baseline:

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   220.210155081   1,569,524,602,961     178.188M/sec    1,363,686,761,705    
> 0.155GHz        833,345,334,408      0.68

and comparing cache management with using some concurrency:

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   164.175492485   913,991,290,231       139.183M/sec    762,359,320,428      
> 0.116GHz        678,451,556,273      0.84
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   164.052510134   1,561,521,537,336     237.972M/sec    1,404,761,167,120    
> 0.214GHz        715,274,337,015      0.51

we can see very similar timing. Which makes sense, because that's
roughly the device's max speed. But then going to higher concurrency,
there's clearly regressions:

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   455.096916715   2,808,715,616,077     154.293M/sec    1,366,660,063,053    
> 0.075GHz        888,512,073,477      0.66

And I think it is instructive to look at the
ref_cycles_tot/cycles_tot/instructions_tot vs
ref_cycles_sec/cycles_sec/ipc. The units are confusing because they are
across all cores and most are idle. But it's pretty obvious that
numprocs=1 sfr=1 fadvise=1 has cores running for a lot shorter time
(reference cycles basically count the time cores were running on a
absolute time scale). Compared to numprocs=2 sfr=0 fadvise=0, which has
the same resulting performance, it's clear that cores were busier, but
less efficient (lower ipc).

With cache mangement there's very little benefit, and some risk (1->2
regression) in this workload with increasing concurrency:

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   164.175492485   913,991,290,231       139.183M/sec    762,359,320,428      
> 0.116GHz        678,451,556,273      0.84
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   188.772193248   1,418,274,870,697     187.803M/sec    923,133,958,500      
> 0.122GHz        799,212,291,243      0.92
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   158.262790654   1,720,443,307,097     271.769M/sec    1,004,079,045,479    
> 0.159GHz        826,905,592,751      0.84


And there's good benefit, but tremendous risk, of concurrency in the no
cache control case:

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   220.210155081   1,569,524,602,961     178.188M/sec    1,363,686,761,705    
> 0.155GHz        833,345,334,408      0.68
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   164.052510134   1,561,521,537,336     237.972M/sec    1,404,761,167,120    
> 0.214GHz        715,274,337,015      0.51
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   455.096916715   2,808,715,616,077     154.293M/sec    1,366,660,063,053    
> 0.075GHz        888,512,073,477      0.66


sync file range without fadvise isn't a benefit at low concurrency, but 
prevents bad regressions at high concurency:
> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   220.210155081   1,569,524,602,961     178.188M/sec    1,363,686,761,705    
> 0.155GHz        833,345,334,408      0.68
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   248.430736196   1,497,048,950,014     150.653M/sec    1,226,822,167,960    
> 0.123GHz        705,950,461,166    0.54

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   164.052510134   1,561,521,537,336     237.972M/sec    1,404,761,167,120    
> 0.214GHz        715,274,337,015      0.51
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   192.151682414   1,526,440,715,456     198.603M/sec    1,037,135,756,007    
> 0.135GHz        802,754,964,096      0.76

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   455.096916715   2,808,715,616,077     154.293M/sec    1,366,660,063,053    
> 0.075GHz        888,512,073,477      0.66
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   256.156100686   2,407,922,637,215     235.003M/sec    1,133,311,037,956    
> 0.111GHz        748,666,206,805      0.65

fadvise alone is similar:
> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   220.210155081   1,569,524,602,961     178.188M/sec    1,363,686,761,705    
> 0.155GHz        833,345,334,408      0.68
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   310.275952938   1,921,817,571,226     154.849M/sec    1,499,581,687,133    
> 0.121GHz        944,243,167,053      0.59

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   164.052510134   1,561,521,537,336     237.972M/sec    1,404,761,167,120    
> 0.214GHz        715,274,337,015      0.51
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   242.648245159   1,782,637,416,163     183.629M/sec    1,463,696,313,881    
> 0.151GHz        1,000,100,694,932    0.69

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   455.096916715   2,808,715,616,077     154.293M/sec    1,366,660,063,053    
> 0.075GHz        888,512,073,477      0.66
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   215.255015340   1,977,578,120,924     229.676M/sec    1,461,504,758,029    
> 0.170GHz        1,005,270,838,642    0.68


There does not appear to be a huge of benefit in fallocate in this
workload, the OS's delayed allocation works well. Compare:

numprocs=1
> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   220.210155081   1,569,524,602,961     178.188M/sec    1,363,686,761,705    
> 0.155GHz        833,345,334,408      0.68
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0   
>   243.609959554   1,802,385,405,203     184.970M/sec    1,449,560,513,247    
> 0.149GHz        855,426,288,031      0.56

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   248.430736196   1,497,048,950,014     150.653M/sec    1,226,822,167,960    
> 0.123GHz        705,950,461,166      0.54
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0   
>   230.880100449   1,328,417,418,799     143.846M/sec    1,148,924,667,393    
> 0.124GHz        723,158,246,628      0.63

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   310.275952938   1,921,817,571,226     154.849M/sec    1,499,581,687,133    
> 0.121GHz        944,243,167,053      0.59
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1   
>   253.591234992   1,548,485,571,798     152.658M/sec    1,229,926,994,613    
> 0.121GHz        1,117,352,436,324    0.95

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   164.175492485   913,991,290,231       139.183M/sec    762,359,320,428      
> 0.116GHz        678,451,556,273      0.84
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   164.488835158   911,974,902,254       138.611M/sec    760,756,011,483      
> 0.116GHz        672,105,046,261      0.84

numprocs=2
> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   164.052510134   1,561,521,537,336     237.972M/sec    1,404,761,167,120    
> 0.214GHz        715,274,337,015      0.51
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0   
>   421.580487642   2,756,486,952,728     163.449M/sec    1,387,708,033,752    
> 0.082GHz        990,478,650,874      0.72

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   192.151682414   1,526,440,715,456     198.603M/sec    1,037,135,756,007    
> 0.135GHz        802,754,964,096      0.76
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0   
>   169.854206542   1,333,619,626,854     196.282M/sec    1,036,261,531,134    
> 0.153GHz        666,052,333,591      0.64

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   242.648245159   1,782,637,416,163     183.629M/sec    1,463,696,313,881    
> 0.151GHz        1,000,100,694,932    0.69
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1   
>   305.078100578   1,970,042,289,192     161.445M/sec    1,505,706,462,812    
> 0.123GHz        954,963,240,648      0.62

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   188.772193248   1,418,274,870,697     187.803M/sec    923,133,958,500      
> 0.122GHz        799,212,291,243      0.92
> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   166.295223626   1,290,699,256,763     194.044M/sec    857,873,391,283      
> 0.129GHz        761,338,026,415      0.89

numprocs=4
> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   455.096916715   2,808,715,616,077     154.293M/sec    1,366,660,063,053    
> 0.075GHz        888,512,073,477      0.66
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0   
>   334.932246893   2,366,388,662,460     176.628M/sec    1,216,049,589,993    
> 0.091GHz        796,698,831,717      0.68

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   256.156100686   2,407,922,637,215     235.003M/sec    1,133,311,037,956    
> 0.111GHz        748,666,206,805      0.65
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0   
>   161.697270285   1,866,036,713,483     288.576M/sec    1,068,181,502,433    
> 0.165GHz        739,559,279,008      0.70

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   215.255015340   1,977,578,120,924     229.676M/sec    1,461,504,758,029    
> 0.170GHz        1,005,270,838,642    0.68
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1   
>   231.440889430   1,965,389,749,057     212.391M/sec    1,407,927,406,358    
> 0.152GHz        997,199,361,968      0.72

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   158.262790654   1,720,443,307,097     271.769M/sec    1,004,079,045,479    
> 0.159GHz        826,905,592,751      0.84
> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   214.433248700   2,232,198,239,769     260.300M/sec    1,073,334,918,389    
> 0.125GHz        861,540,079,120      0.80

I would say that it seems to help concurrent cases without cache
control, but not particularly reliably so. At higher concurrency it
seems to hurt with cache control, not sure I undstand why.


I was at first confused why 128kb write sizes hurt (128kb is probably on
the higher end of useful, but I wanted to have see a more extreme
difference):

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc
> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   220.210155081   1,569,524,602,961     178.188M/sec    1,363,686,761,705    
> 0.155GHz        833,345,334,408      0.68
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=0 
>   644.521613661   3,688,449,404,537     143.079M/sec    2,020,128,131,309    
> 0.078GHz        961,486,630,359      0.48

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   248.430736196   1,497,048,950,014     150.653M/sec    1,226,822,167,960    
> 0.123GHz        705,950,461,166      0.54
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=0 
>   243.830464632   1,499,608,983,445     153.756M/sec    1,227,468,439,403    
> 0.126GHz        691,534,661,654      0.59

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   310.275952938   1,921,817,571,226     154.849M/sec    1,499,581,687,133    
> 0.121GHz        944,243,167,053      0.59
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=0 fadvise=1 
>   292.866419420   1,753,376,415,877     149.677M/sec    1,483,169,463,392    
> 0.127GHz        860,035,914,148      0.56

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   164.175492485   913,991,290,231       139.183M/sec    762,359,320,428      
> 0.116GHz        678,451,556,273      0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 
>   162.152397194   925,643,754,128       142.719M/sec    743,208,501,601      
> 0.115GHz        554,462,585,110      0.70

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=0   
>   243.609959554   1,802,385,405,203     184.970M/sec    1,449,560,513,247    
> 0.149GHz        855,426,288,031      0.56
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=0 
>   211.369510165   1,558,996,898,599     184.401M/sec    1,359,343,408,200    
> 0.161GHz        766,769,036,524      0.57

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=0   
>   230.880100449   1,328,417,418,799     143.846M/sec    1,148,924,667,393    
> 0.124GHz        723,158,246,628      0.63
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=0 
>   233.315094908   1,427,133,080,540     152.927M/sec    1,166,000,868,597    
> 0.125GHz        743,027,329,074      0.64

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=0 fadvise=1   
>   253.591234992   1,548,485,571,798     152.658M/sec    1,229,926,994,613    
> 0.121GHz        1,117,352,436,324    0.95
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=0 fadvise=1 
>   290.698155820   1,732,849,079,701     149.032M/sec    1,441,508,612,326    
> 0.124GHz        835,039,426,282      0.57

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   164.488835158   911,974,902,254       138.611M/sec    760,756,011,483      
> 0.116GHz        672,105,046,261      0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 
>   159.945462440   850,162,390,626       132.892M/sec    724,286,281,548      
> 0.113GHz        670,069,573,150      0.90

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   164.052510134   1,561,521,537,336     237.972M/sec    1,404,761,167,120    
> 0.214GHz        715,274,337,015      0.51
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=0 
>   163.244592275   1,524,807,507,173     233.531M/sec    1,398,319,581,978    
> 0.214GHz        689,514,058,243      0.46

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   192.151682414   1,526,440,715,456     198.603M/sec    1,037,135,756,007    
> 0.135GHz        802,754,964,096      0.76
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=0 
>   231.795934322   1,731,030,267,153     186.686M/sec    1,124,935,745,020    
> 0.121GHz        736,084,922,669      0.70

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   242.648245159   1,782,637,416,163     183.629M/sec    1,463,696,313,881    
> 0.151GHz        1,000,100,694,932    0.69
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=0 fadvise=1 
>   315.564163702   1,958,199,733,216     155.128M/sec    1,405,115,546,716    
> 0.111GHz        1,000,595,890,394    0.73

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   188.772193248   1,418,274,870,697     187.803M/sec    923,133,958,500      
> 0.122GHz        799,212,291,243      0.92
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 
>   210.945487961   1,527,169,148,899     180.990M/sec    906,023,518,692      
> 0.107GHz        700,166,552,207      0.80

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=0   
>   421.580487642   2,756,486,952,728     163.449M/sec    1,387,708,033,752    
> 0.082GHz        990,478,650,874      0.72
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=0 
>   161.759094088   1,468,321,054,671     226.934M/sec    1,221,167,105,510    
> 0.189GHz        735,855,415,612      0.59

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=0   
>   169.854206542   1,333,619,626,854     196.282M/sec    1,036,261,531,134    
> 0.153GHz        666,052,333,591      0.64
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=0 
>   158.578248952   1,354,770,825,277     213.586M/sec    936,436,363,752      
> 0.148GHz        654,823,079,884      0.68

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=0 fadvise=1   
>   305.078100578   1,970,042,289,192     161.445M/sec    1,505,706,462,812    
> 0.123GHz        954,963,240,648      0.62
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=0 fadvise=1 
>   274.628500801   1,792,841,068,080     163.209M/sec    1,343,398,055,199    
> 0.122GHz        996,073,874,051      0.73

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   166.295223626   1,290,699,256,763     194.044M/sec    857,873,391,283      
> 0.129GHz        761,338,026,415      0.89
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 
>   179.140070123   1,383,595,004,328     193.095M/sec    850,299,722,091      
> 0.119GHz        706,959,617,654      0.83

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=0   
>   455.096916715   2,808,715,616,077     154.293M/sec    1,366,660,063,053    
> 0.075GHz        888,512,073,477      0.66
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=0 
>   445.496787199   2,663,914,572,687     149.495M/sec    1,267,340,496,930    
> 0.071GHz        787,469,552,454      0.62

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=0   
>   256.156100686   2,407,922,637,215     235.003M/sec    1,133,311,037,956    
> 0.111GHz        748,666,206,805      0.65
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=0 
>   261.866083604   2,325,884,820,091     222.043M/sec    1,094,814,208,219    
> 0.105GHz        649,479,233,453      0.57

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=0 fadvise=1   
>   215.255015340   1,977,578,120,924     229.676M/sec    1,461,504,758,029    
> 0.170GHz        1,005,270,838,642    0.68
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=0 fadvise=1 
>   172.963505544   1,717,387,683,260     248.228M/sec    1,356,381,335,831    
> 0.196GHz        822,256,638,370      0.58

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   158.262790654   1,720,443,307,097     271.769M/sec    1,004,079,045,479    
> 0.159GHz        826,905,592,751      0.84
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 
>   157.934678897   1,650,503,807,778     261.266M/sec    970,705,561,971      
> 0.154GHz        637,953,927,131      0.66

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=0   
>   334.932246893   2,366,388,662,460     176.628M/sec    1,216,049,589,993    
> 0.091GHz        796,698,831,717      0.68
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=0 
>   225.623143601   1,804,402,820,599     199.938M/sec    1,086,394,788,362    
> 0.120GHz        656,392,112,807      0.62

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=0   
>   161.697270285   1,866,036,713,483     288.576M/sec    1,068,181,502,433    
> 0.165GHz        739,559,279,008      0.70
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=0 
>   157.930900998   1,797,506,082,342     284.548M/sec    1,001,509,813,741    
> 0.159GHz        644,107,150,289      0.66

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=0 fadvise=1   
>   231.440889430   1,965,389,749,057     212.391M/sec    1,407,927,406,358    
> 0.152GHz        997,199,361,968      0.72
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=0 fadvise=1 
>   165.772265335   1,805,895,001,689     272.353M/sec    1,514,173,918,970    
> 0.228GHz        823,435,044,810      0.54

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   214.433248700   2,232,198,239,769     260.300M/sec    1,073,334,918,389    
> 0.125GHz        861,540,079,120      0.80
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 
>   187.664764448   1,964,118,348,429     261.660M/sec    978,060,510,880      
> 0.130GHz        668,316,194,988      0.67

It's pretty clear that the larger write block size can hurt quite
badly. I was somewhat confused by this at first, but after thinking
about it for a while longer it actually makes sense: For the OS to
finish an 8k write it needs to find two free pagecache pages. For an
128k write it needs to find 32. Which means that it's much more likely
that kernel threads and the writes are going to fight over locks /
cachelines: In the 8k page it's quite likely that ofen the kernel
threads will do so while the memcpy() from userland is happening, but
that's less the case with 32 pages that need to be acquired before the
memcpy() can happen.

With cache control that problem doesn't exist, which is why the larger
block size is beneficial:

> test                                                                          
>   time            ref_cycles_tot        ref_cycles_sec  cycles_tot           
> cycles_sec      instructions_tot      ipc

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   164.175492485   913,991,290,231       139.183M/sec    762,359,320,428      
> 0.116GHz        678,451,556,273      0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=0 sfr=1 fadvise=1 
>   162.152397194   925,643,754,128       142.719M/sec    743,208,501,601      
> 0.115GHz        554,462,585,110      0.70

> numprocs=1 filesize=429496729600 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   164.488835158   911,974,902,254       138.611M/sec    760,756,011,483      
> 0.116GHz        672,105,046,261      0.84
> numprocs=1 filesize=429496729600 blocksize=131072 fallocate=1 sfr=1 fadvise=1 
>   159.945462440   850,162,390,626       132.892M/sec    724,286,281,548      
> 0.113GHz        670,069,573,150      0.90

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   188.772193248   1,418,274,870,697     187.803M/sec    923,133,958,500      
> 0.122GHz        799,212,291,243      0.92
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=0 sfr=1 fadvise=1 
>   210.945487961   1,527,169,148,899     180.990M/sec    906,023,518,692      
> 0.107GHz        700,166,552,207      0.80

> numprocs=2 filesize=214748364800 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   166.295223626   1,290,699,256,763     194.044M/sec    857,873,391,283      
> 0.129GHz        761,338,026,415      0.89
> numprocs=2 filesize=214748364800 blocksize=131072 fallocate=1 sfr=1 fadvise=1 
>   179.140070123   1,383,595,004,328     193.095M/sec    850,299,722,091      
> 0.119GHz        706,959,617,654      0.83

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=0 sfr=1 fadvise=1   
>   158.262790654   1,720,443,307,097     271.769M/sec    1,004,079,045,479    
> 0.159GHz        826,905,592,751      0.84
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=0 sfr=1 fadvise=1 
>   157.934678897   1,650,503,807,778     261.266M/sec    970,705,561,971      
> 0.154GHz        637,953,927,131      0.66

> numprocs=4 filesize=107374182400 blocksize=8192 fallocate=1 sfr=1 fadvise=1   
>   214.433248700   2,232,198,239,769     260.300M/sec    1,073,334,918,389    
> 0.125GHz        861,540,079,120      0.80
> numprocs=4 filesize=107374182400 blocksize=131072 fallocate=1 sfr=1 fadvise=1 
>   187.664764448   1,964,118,348,429     261.660M/sec    978,060,510,880      
> 0.130GHz        668,316,194,988      0.67

Note how especially in the first few cases the total number of
instructions required is improved (although due to the way I did the
perf stat the sampling error is pretty large).


I haven't run that test yest, but after looking at all this I would bet
that reducing the block size to 4kb (i.e. a single os/hw page) would
help the no cache control case significantly, in particular in the
concurrent case.

And conversely, I'd expect that the CPU efficiency will be improved by
larger block size for the cache control case for just about any
realistic block size.


I'd love to have a faster storage available (faster NVMes, or multiple
ones I can use for benchmarking) to see what the cutoff point for
actually benefiting from concurrency is.


Also worthwhile to note that even the "best case" from a CPU usage point
here absolutely *pales* against using direct-IO. It's not an
apples/apples comparison, but comparing buffered io using
write_and_fsync, and unbuffered io using fio:

128KiB blocksize:

write_and_fsync:
echo 3 |sudo tee /proc/sys/vm/drop_caches && /usr/bin/time perf stat -a -e 
cpu-clock,ref-cycles,cycles,instructions /tmp/write_and_fsync --blocksize 
$((128*1024)) --sync_file_range=1 --fallocate=1 --fadvise=1 --sequential=0 
--filesize=$((400*1024*1024*1024)) /srv/dev/bench/test1

 Performance counter stats for 'system wide':

      6,377,903.65 msec cpu-clock                 #   39.999 CPUs utilized
   628,014,590,200      ref-cycles                #   98.467 M/sec
   634,468,623,514      cycles                    #    0.099 GHz
   795,771,756,320      instructions              #    1.25  insn per cycle

     159.451492209 seconds time elapsed

fio:
rm -f /srv/dev/bench/test* && echo 3 |sudo tee /proc/sys/vm/drop_caches && 
/usr/bin/time perf stat -a -e cpu-clock,ref-cycles,cycles,instructions fio 
--name=test --iodepth=512 --iodepth_low=8 --iodepth_batch_submit=8 
--iodepth_batch_complete_min=8 --iodepth_batch_complete_max=128 
--ioengine=libaio --rw=write --bs=128k --filesize=$((400*1024*1024*1024))  
--direct=1 --numjobs=1

 Performance counter stats for 'system wide':

      6,313,522.71 msec cpu-clock                 #   39.999 CPUs utilized
   458,476,185,800      ref-cycles                #   72.618 M/sec
   196,148,015,054      cycles                    #    0.031 GHz
   158,921,457,853      instructions              #    0.81  insn per cycle

     157.842080440 seconds time elapsed

CPU usage for fio most of the time was around 98% for write_and_fsync
and 40% for fio.

I.e. system-wide CPUs were active 0.73x the time, and 0.2x as many
instructions had to be executed in the DIO case.

Greetings,

Andres Freund


Reply via email to