Re: zram: per-cpu compression streams
On (04/27/16 17:54), Sergey Senozhatsky wrote: > #jobs4 > READ: 19948MB/s 20013MB/s > READ: 17732MB/s 17479MB/s > WRITE: 630690KB/s 495078KB/s > WRITE: 1843.2MB/s 2226.9MB/s > READ: 1603.4MB/s 1846.8MB/s > WRITE: 1599.4MB/s 1842.2MB/s > READ: 1547.7MB/s 1740.7MB/s > WRITE: 1549.2MB/s 1742.4MB/s > jobs4 > stalled-cycles-frontend 265,519,049,536 ( 64.46%) > 221,049,841,649 ( 61.81%) > stalled-cycles-backend 146,538,881,296 ( 35.57%) > 113,774,053,039 ( 31.82%) > instructions298,241,854,695 (0.72) > 278,000,866,874 (0.78) > branches 59,531,800,053 ( 400.919) > 55,096,944,109 ( 427.816) > branch-misses 285,108,083 ( 0.48%) > 260,972,185 ( 0.47%) > seconds elapsed47.816933840 52.966896478 per-cpu in general looks better in this test (jobs4): less stalls, less branches, less misses, better fio speeds (except for WRITE: 630690KB/s 495078KB/s). the system was under pressure, so quite possible that it took more time to kill the process, thus execution time is in favor of 8 streams test. -ss
Re: zram: per-cpu compression streams
Hello, more tests. I did only 8streams vs per-cpu this time. the changes to the test are: -- mem-hogger now per-faults pages in parallel with fio -- mem-hogger alloc size increased from 3GB to 4GB. the system couldn't survive 4GB/4GB zram(buffer_compress_percentage=11)/mem-hogger split (OOM), so I executed the 3GB/4GB test (close to system's OOM edge). -- 4 GB x86_64 -- 3 GB zram lzo firts, the mm_stat. 8 streams (base kernel): 3221225472 3221225472 32212254720 322122956800 < 2752460/ 0> 3221225472 3221225472 32212254720 322123366400 < 5504124/ 0> 3221225472 2912157607 29528023040 29528268800 81 < 8253369/ 0> 3221225472 2893479936 28991201280 28991365120 147 <11003056/ 0> 3221217280 2886040814 28991037440 28991283200 26 <13748450/ 0> 3221225472 2880045056 28856934400 28857180160 180 <16503120/ 0> 3221213184 2877431364 28837560320 28838092800 132 <19259891/ 0> 3221225472 2873229312 28760965120 28761333760 16 <22016512/ 0> 3221213184 2870728008 28716933120 28717260800 24 <24768909/ 0> 2899095552 2899095552 28990955520 2899132416786430 <27523600/ 0> per-cpu: 3221225472 3221225472 32212254720 322122956800 < 2752460/8180> 3221225472 3221225472 32212254720 322123366400 < 5504124/ 10523> 3221225472 2912157607 29528023040 29528145920 117 < 8253369/9451> 3221225472 2893479936 28991201280 28991365120 129 <11003056/9395> 3221217280 2886040814 28991037440 28991283200 51 <13748450/ 10879> 3221225472 2880045056 28856934400 28857180160 126 <16503120/ 10300> 3221213184 2877431364 28837724160 28838010880 252 <19259891/ 10509> 3221225472 2873229312 28761006080 28761333760 14 <22016512/ 11081> 3221213184 2870728008 28716933120 28717301760 54 <24768909/ 10770> 2899095552 2899095552 28990955520 2899136512786430 <27523600/ 10231> mem-hogger pre-fault times 8 streams (base kernel): [431] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f3f5d38a010 <+ 6.031550428> [470] single-alloc: INFO: Allocated 0x1 bytes at address 0x7fa29d414010 <+ 5.242295692> [514] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f4a7eac8010 <+ 5.485469454> [563] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f07da76b010 <+ 5.563647658> [619] single-alloc: INFO: Allocated 0x1 bytes at address 0x7ff5efc26010 <+ 5.516866208> [681] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f8fb896d010 <+ 5.535275748> [751] single-alloc: INFO: Allocated 0x1 bytes at address 0x7fb2ac6fa010 <+ 4.594626366> [825] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f355f9a0010 <+ 5.075849029> [905] single-alloc: INFO: Allocated 0x1 bytes at address 0x7feb16715010 <+ 4.696363680> [991] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f3a1b9f4010 <+ 5.292365453> per-cpu: [413] single-alloc: INFO: Allocated 0x1 bytes at address 0x7fe8058f5010 <+ 5.513944292> [451] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f65fe753010 <+ 4.742384977> [494] single-alloc: INFO: Allocated 0x1 bytes at address 0x7fb99a05c010 <+ 5.394711696> [542] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f0d61c81010 <+ 5.021011664> [598] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f9abdeb6010 <+ 5.094722019> [660] single-alloc: INFO: Allocated 0x1 bytes at address 0x7fb192ae9010 <+ 4.943961060> [728] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f7313aeb010 <+ 5.437872456> [802] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f25ffdeb010 <+ 5.422829590> [881] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f60daa8e010 <+ 4.806425351> [970] single-alloc: INFO: Allocated 0x1 bytes at address 0x7f384cf04010 <+ 4.982513395> so, pre-fault time range is somewhat big. for example, from 4.696363680 to 6.031550428 seconds. fio 8 streamsper-cpu === #jobs1 READ: 2507.8MB/s 2526.4MB/s READ: 2043.1MB/s 1970.6MB/s WRITE: 127100KB/s 139160KB/s WRITE: 724488KB/s 733440KB/s READ: 534624KB/s 540967KB/s WRITE: 534569KB/s 540912KB/s READ: 471165KB/s 477459KB/s WRITE: 471233KB/s 477527KB/s #jobs2
Re: zram: per-cpu compression streams
On (04/27/16 16:55), Minchan Kim wrote: [..] > > > Could you test concurrent mem hogger with fio rather than pre-fault > > > before fio test > > > in next submit? > > > > this test will not prove anything, unfortunately. I performed it; > > and it's impossible to guarantee even remotely stable results. > > mem-hogger process can spend on pre-fault from 41 to 81 seconds; > > so I'm quite sceptical about the actual value of this test. > > > > > > considering buffer_compress_percentage=11, the box was under somewhat > > > > heavy pressure. > > > > > > > > now, the results > > > > > > Yeb, Even, recompression case is fater than old but want to see more > > > heavy memory > > > pressure case and the ratio I mentioned above. > > > > I did quite heavy testing over the last 7 days, with numerous OOM kills > > and OOM panics. > > Okay, I think it's worth to merge enough and see the result. > Please send formal patch which has recompression stat. ;-) correction: those 41-81s spikes in mem-hogger were observed under different scenario: 10GB zram with 6GB mem-hogger on a 4GB system. I'll do another round of tests (with parallel mem-hogger pre-fault and 4GB/4GB zram/mem-hogger split) and collect the number that you asked for. thanks! -ss
Re: zram: per-cpu compression streams
On Wed, Apr 27, 2016 at 04:43:35PM +0900, Sergey Senozhatsky wrote: > Hello, > > On (04/27/16 16:29), Minchan Kim wrote: > [..] > > > the test: > > > > > > -- 4 GB x86_64 box > > > -- zram 3GB, lzo > > > -- mem-hogger pre-faults 3GB of pages before the fio test > > > -- fio test has been modified to have 11% compression ratio (to increase > > > the > > > chances of > > > re-compressions) > > > > Could you test concurrent mem hogger with fio rather than pre-fault before > > fio test > > in next submit? > > this test will not prove anything, unfortunately. I performed it; > and it's impossible to guarantee even remotely stable results. > mem-hogger process can spend on pre-fault from 41 to 81 seconds; > so I'm quite sceptical about the actual value of this test. > > > > considering buffer_compress_percentage=11, the box was under somewhat > > > heavy pressure. > > > > > > now, the results > > > > Yeb, Even, recompression case is fater than old but want to see more heavy > > memory > > pressure case and the ratio I mentioned above. > > I did quite heavy testing over the last 7 days, with numerous OOM kills > and OOM panics. Okay, I think it's worth to merge enough and see the result. Please send formal patch which has recompression stat. ;-) Thanks.
Re: zram: per-cpu compression streams
Hello, On (04/27/16 16:29), Minchan Kim wrote: [..] > > the test: > > > > -- 4 GB x86_64 box > > -- zram 3GB, lzo > > -- mem-hogger pre-faults 3GB of pages before the fio test > > -- fio test has been modified to have 11% compression ratio (to increase the > > chances of > > re-compressions) > > Could you test concurrent mem hogger with fio rather than pre-fault before > fio test > in next submit? this test will not prove anything, unfortunately. I performed it; and it's impossible to guarantee even remotely stable results. mem-hogger process can spend on pre-fault from 41 to 81 seconds; so I'm quite sceptical about the actual value of this test. > > considering buffer_compress_percentage=11, the box was under somewhat > > heavy pressure. > > > > now, the results > > Yeb, Even, recompression case is fater than old but want to see more heavy > memory > pressure case and the ratio I mentioned above. I did quite heavy testing over the last 7 days, with numerous OOM kills and OOM panics. -ss
Re: zram: per-cpu compression streams
Hello Sergey, On Tue, Apr 26, 2016 at 08:23:05PM +0900, Sergey Senozhatsky wrote: > Hello Minchan, > > On (04/19/16 17:00), Minchan Kim wrote: > [..] > > I'm convinced now with your data. Super thanks! > > However, as you know, we need data how bad it is in heavy memory pressure. > > Maybe, you can test it with fio and backgound memory hogger, > > it's really hard to produce stable test results when the system > is under mem pressure. > > first, I modified zram to export the re-compression number > (put cpu stream and re-try handler allocation) > > mm_stat for numjobs{1..10}. the number of re-compressions is in "< NUM>" > format > > 3221225472 3221225472 32212254720 322122956800 < > 6421> > 3221225472 3221225472 32212254720 322123366400 < > 6998> > 3221225472 2912157607 29528023040 29528145920 84 < > 7271> > 3221225472 2893479936 28991201280 28991365120 156 < > 8260> > 3221217280 2886040814 28990996480 28991283200 78 < > 8297> > 3221225472 2880045056 28856934400 28857180160 54 < > 7794> > 3221213184 2877431364 28837560320 28838010880 144 < > 7336> > 3221225472 2873229312 28760965120 28761333760 28 < > 8699> > 3221213184 2870728008 28716933120 28717301760 30 < > 8189> > 2899095552 2899095552 28990955520 2899136512786430 < > 7485> It would be great when we see the below ratio for each test. 1-compression : 2(re)-compression > > as we can see, the number of re-compressions can vary from 6421 to 8699. > > > the test: > > -- 4 GB x86_64 box > -- zram 3GB, lzo > -- mem-hogger pre-faults 3GB of pages before the fio test > -- fio test has been modified to have 11% compression ratio (to increase the > chances of re-compressions) Could you test concurrent mem hogger with fio rather than pre-fault before fio test in next submit? >-- buffer_compress_percentage=11 >-- scramble_buffers=0 > > > considering buffer_compress_percentage=11, the box was under somewhat > heavy pressure. > > now, the results Yeb, Even, recompression case is fater than old but want to see more heavy memory pressure case and the ratio I mentioned above. If the result is still good, please send public patch with number. Thanks for looking this, Sergey! > > > fio stats > > 4 streams8 streams per cpu > === > #jobs1 > READ: 2411.4MB/s 2430.4MB/s 2440.4MB/s > READ: 2094.8MB/s 2002.7MB/s 2034.5MB/s > WRITE: 141571KB/s 140334KB/s 143542KB/s > WRITE: 712025KB/s 706111KB/s 745256KB/s > READ: 531014KB/s 525250KB/s 537547KB/s > WRITE: 530960KB/s 525197KB/s 537492KB/s > READ: 473577KB/s 470320KB/s 476880KB/s > WRITE: 473645KB/s 470387KB/s 476948KB/s > #jobs2 > READ: 7897.2MB/s 8031.4MB/s 7968.9MB/s > READ: 6864.9MB/s 6803.2MB/s 6903.4MB/s > WRITE: 321386KB/s 314227KB/s 313101KB/s > WRITE: 1275.3MB/s 1245.6MB/s 1383.5MB/s > READ: 1035.5MB/s 1021.9MB/s 1098.4MB/s > WRITE: 1035.6MB/s 1021.1MB/s 1098.6MB/s > READ: 972014KB/s 952321KB/s 987.66MB/s > WRITE: 969792KB/s 950144KB/s 985.40MB/s > #jobs3 > READ: 13260MB/s 13260MB/s 13222MB/s > READ: 11636MB/s 11636MB/s 11755MB/s > WRITE: 511500KB/s 507730KB/s 504959KB/s > WRITE: 1646.1MB/s 1673.9MB/s 1755.5MB/s > READ: 1389.5MB/s 1387.2MB/s 1479.6MB/s > WRITE: 1387.6MB/s 1385.3MB/s 1477.4MB/s > READ: 1286.8MB/s 1289.1MB/s 1377.3MB/s > WRITE: 1284.8MB/s 1287.1MB/s 1374.9MB/s > #jobs4 > READ: 19851MB/s 20244MB/s 20344MB/s > READ: 17732MB/s 17835MB/s 18097MB/s > WRITE: 667776KB/s 655599KB/s 693464KB/s > WRITE: 2041.2MB/s 2072.6MB/s 2474.1MB/s > READ: 1770.1MB/s 1781.7MB/s 2035.5MB/s > WRITE: 1765.8MB/s 1777.3MB/s 2030.5MB/s > READ: 1641.6MB/s 1672.4MB/s 1892.5MB/s > WRITE: 1643.2MB/s 1674.2MB/s 1894.4MB/s > #jobs5 > READ: 19468MB/s 1848
Re: zram: per-cpu compression streams
Hello Minchan, On (04/19/16 17:00), Minchan Kim wrote: [..] > I'm convinced now with your data. Super thanks! > However, as you know, we need data how bad it is in heavy memory pressure. > Maybe, you can test it with fio and backgound memory hogger, it's really hard to produce stable test results when the system is under mem pressure. first, I modified zram to export the re-compression number (put cpu stream and re-try handler allocation) mm_stat for numjobs{1..10}. the number of re-compressions is in "< NUM>" format 3221225472 3221225472 32212254720 322122956800 < 6421> 3221225472 3221225472 32212254720 322123366400 < 6998> 3221225472 2912157607 29528023040 29528145920 84 < 7271> 3221225472 2893479936 28991201280 28991365120 156 < 8260> 3221217280 2886040814 28990996480 28991283200 78 < 8297> 3221225472 2880045056 28856934400 28857180160 54 < 7794> 3221213184 2877431364 28837560320 28838010880 144 < 7336> 3221225472 2873229312 28760965120 28761333760 28 < 8699> 3221213184 2870728008 28716933120 28717301760 30 < 8189> 2899095552 2899095552 28990955520 2899136512786430 < 7485> as we can see, the number of re-compressions can vary from 6421 to 8699. the test: -- 4 GB x86_64 box -- zram 3GB, lzo -- mem-hogger pre-faults 3GB of pages before the fio test -- fio test has been modified to have 11% compression ratio (to increase the chances of re-compressions) -- buffer_compress_percentage=11 -- scramble_buffers=0 considering buffer_compress_percentage=11, the box was under somewhat heavy pressure. now, the results fio stats 4 streams8 streams per cpu === #jobs1 READ: 2411.4MB/s 2430.4MB/s 2440.4MB/s READ: 2094.8MB/s 2002.7MB/s 2034.5MB/s WRITE: 141571KB/s 140334KB/s 143542KB/s WRITE: 712025KB/s 706111KB/s 745256KB/s READ: 531014KB/s 525250KB/s 537547KB/s WRITE: 530960KB/s 525197KB/s 537492KB/s READ: 473577KB/s 470320KB/s 476880KB/s WRITE: 473645KB/s 470387KB/s 476948KB/s #jobs2 READ: 7897.2MB/s 8031.4MB/s 7968.9MB/s READ: 6864.9MB/s 6803.2MB/s 6903.4MB/s WRITE: 321386KB/s 314227KB/s 313101KB/s WRITE: 1275.3MB/s 1245.6MB/s 1383.5MB/s READ: 1035.5MB/s 1021.9MB/s 1098.4MB/s WRITE: 1035.6MB/s 1021.1MB/s 1098.6MB/s READ: 972014KB/s 952321KB/s 987.66MB/s WRITE: 969792KB/s 950144KB/s 985.40MB/s #jobs3 READ: 13260MB/s13260MB/s 13222MB/s READ: 11636MB/s11636MB/s 11755MB/s WRITE: 511500KB/s 507730KB/s 504959KB/s WRITE: 1646.1MB/s 1673.9MB/s 1755.5MB/s READ: 1389.5MB/s 1387.2MB/s 1479.6MB/s WRITE: 1387.6MB/s 1385.3MB/s 1477.4MB/s READ: 1286.8MB/s 1289.1MB/s 1377.3MB/s WRITE: 1284.8MB/s 1287.1MB/s 1374.9MB/s #jobs4 READ: 19851MB/s20244MB/s 20344MB/s READ: 17732MB/s17835MB/s 18097MB/s WRITE: 667776KB/s 655599KB/s 693464KB/s WRITE: 2041.2MB/s 2072.6MB/s 2474.1MB/s READ: 1770.1MB/s 1781.7MB/s 2035.5MB/s WRITE: 1765.8MB/s 1777.3MB/s 2030.5MB/s READ: 1641.6MB/s 1672.4MB/s 1892.5MB/s WRITE: 1643.2MB/s 1674.2MB/s 1894.4MB/s #jobs5 READ: 19468MB/s18484MB/s 18439MB/s READ: 17594MB/s17757MB/s 17716MB/s WRITE: 843266KB/s 859627KB/s 867928KB/s WRITE: 1927.1MB/s 2041.8MB/s 2168.9MB/s READ: 1718.6MB/s 1771.7MB/s 1963.5MB/s WRITE: 1712.7MB/s 1765.6MB/s 1956.8MB/s READ: 1705.3MB/s 1663.6MB/s 1767.3MB/s WRITE: 1704.3MB/s 1662.6MB/s 1766.2MB/s #jobs6 READ: 21583MB/s21685MB/s 21483MB/s READ: 19160MB/s18432MB/s 18618MB/s WRITE: 986276KB/s 1004.2MB/s 981.11MB/
Re: zram: per-cpu compression streams
Hello Minchan, On (04/19/16 17:00), Minchan Kim wrote: > Great! > > So, based on your experiment, the reason I couldn't see such huge win > in my mahcine is cache size difference(i.e., yours is twice than mine, > IIRC.) and my perf stat didn't show such big difference. > If I have a time, I will test it in bigger machine. quite possible it's due to the cache size. [..] > > NOTE: > > -- fio seems does not attempt to write to device more than disk size, so > >the test don't include 're-compresion path'. > > I'm convinced now with your data. Super thanks! > However, as you know, we need data how bad it is in heavy memory pressure. > Maybe, you can test it with fio and backgound memory hogger, yeah, sure, will work on it. > Thanks for the test, Sergey! thanks! -ss
Re: zram: per-cpu compression streams
On Mon, Apr 18, 2016 at 04:57:58PM +0900, Sergey Senozhatsky wrote: > Hello Minchan, > sorry, it took me so long to return back to testing. > > I collected extended stats (perf), just like you requested. > - 3G zram, lzo; 4 CPU x86_64 box. > - fio with perf stat > > 4 streams8 streams per-cpu > === > #jobs1 > READ: 2520.1MB/s 2566.5MB/s 2491.5MB/s > READ: 2102.7MB/s 2104.2MB/s 2091.3MB/s > WRITE: 1355.1MB/s 1320.2MB/s 1378.9MB/s > WRITE: 1103.5MB/s 1097.2MB/s 1122.5MB/s > READ: 434013KB/s 435153KB/s 439961KB/s > WRITE: 433969KB/s 435109KB/s 439917KB/s > READ: 403166KB/s 405139KB/s 403373KB/s > WRITE: 403223KB/s 405197KB/s 403430KB/s > #jobs2 > READ: 7958.6MB/s 8105.6MB/s 8073.7MB/s > READ: 6864.9MB/s 6989.8MB/s 7021.8MB/s > WRITE: 2438.1MB/s 2346.9MB/s 3400.2MB/s > WRITE: 1994.2MB/s 1990.3MB/s 2941.2MB/s > READ: 981504KB/s 973906KB/s 1018.8MB/s > WRITE: 981659KB/s 974060KB/s 1018.1MB/s > READ: 937021KB/s 938976KB/s 987250KB/s > WRITE: 934878KB/s 936830KB/s 984993KB/s > #jobs3 > READ: 13280MB/s 13553MB/s 13553MB/s > READ: 11534MB/s 11785MB/s 11755MB/s > WRITE: 3456.9MB/s 3469.9MB/s 4810.3MB/s > WRITE: 3029.6MB/s 3031.6MB/s 4264.8MB/s > READ: 1363.8MB/s 1362.6MB/s 1448.9MB/s > WRITE: 1361.9MB/s 1360.7MB/s 1446.9MB/s > READ: 1309.4MB/s 1310.6MB/s 1397.5MB/s > WRITE: 1307.4MB/s 1308.5MB/s 1395.3MB/s > #jobs4 > READ: 20244MB/s 20177MB/s 20344MB/s > READ: 17886MB/s 17913MB/s 17835MB/s > WRITE: 4071.6MB/s 4046.1MB/s 6370.2MB/s > WRITE: 3608.9MB/s 3576.3MB/s 5785.4MB/s > READ: 1824.3MB/s 1821.6MB/s 1997.5MB/s > WRITE: 1819.8MB/s 1817.4MB/s 1992.5MB/s > READ: 1765.7MB/s 1768.3MB/s 1937.3MB/s > WRITE: 1767.5MB/s 1769.1MB/s 1939.2MB/s > #jobs5 > READ: 18663MB/s 18986MB/s 18823MB/s > READ: 16659MB/s 16605MB/s 16954MB/s > WRITE: 3912.4MB/s 3888.7MB/s 6126.9MB/s > WRITE: 3506.4MB/s 3442.5MB/s 5519.3MB/s > READ: 1798.2MB/s 1746.5MB/s 1935.8MB/s > WRITE: 1792.7MB/s 1740.7MB/s 1929.1MB/s > READ: 1727.6MB/s 1658.2MB/s 1917.3MB/s > WRITE: 1726.5MB/s 1657.2MB/s 1916.6MB/s > #jobs6 > READ: 21017MB/s 20922MB/s 21162MB/s > READ: 19022MB/s 19140MB/s 18770MB/s > WRITE: 3968.2MB/s 4037.7MB/s 6620.8MB/s > WRITE: 3643.5MB/s 3590.2MB/s 6027.5MB/s > READ: 1871.8MB/s 1880.5MB/s 2049.9MB/s > WRITE: 1867.8MB/s 1877.2MB/s 2046.2MB/s > READ: 1755.8MB/s 1710.3MB/s 1964.7MB/s > WRITE: 1750.5MB/s 1705.9MB/s 1958.8MB/s > #jobs7 > READ: 21103MB/s 20677MB/s 21482MB/s > READ: 18522MB/s 18379MB/s 19443MB/s > WRITE: 4022.5MB/s 4067.4MB/s 6755.9MB/s > WRITE: 3691.7MB/s 3695.5MB/s 5925.6MB/s > READ: 1841.5MB/s 1933.9MB/s 2090.5MB/s > WRITE: 1842.7MB/s 1935.3MB/s 2091.9MB/s > READ: 1832.4MB/s 1856.4MB/s 1971.5MB/s > WRITE: 1822.3MB/s 1846.2MB/s 1960.6MB/s > #jobs8 > READ: 20463MB/s 20194MB/s 20862MB/s > READ: 18178MB/s 17978MB/s 18299MB/s > WRITE: 4085.9MB/s 4060.2MB/s 7023.8MB/s > WRITE: 3776.3MB/s 3737.9MB/s 6278.2MB/s > READ: 1957.6MB/s 1944.4MB/s 2109.5MB/s > WRITE: 1959.2MB/s 1946.2MB/s 2111.4MB/s > READ: 1900.6MB/s 1885.7MB/s 2082.1MB/s > WRITE: 1896.2MB/s 1881.4MB/s 2078.3MB/s > #jobs9 > READ: 19692MB/s 19734MB/s 19334MB
Re: zram: per-cpu compression streams
Hello Minchan, sorry, it took me so long to return back to testing. I collected extended stats (perf), just like you requested. - 3G zram, lzo; 4 CPU x86_64 box. - fio with perf stat 4 streams8 streams per-cpu === #jobs1 READ: 2520.1MB/s 2566.5MB/s 2491.5MB/s READ: 2102.7MB/s 2104.2MB/s 2091.3MB/s WRITE: 1355.1MB/s 1320.2MB/s 1378.9MB/s WRITE: 1103.5MB/s 1097.2MB/s 1122.5MB/s READ: 434013KB/s 435153KB/s 439961KB/s WRITE: 433969KB/s 435109KB/s 439917KB/s READ: 403166KB/s 405139KB/s 403373KB/s WRITE: 403223KB/s 405197KB/s 403430KB/s #jobs2 READ: 7958.6MB/s 8105.6MB/s 8073.7MB/s READ: 6864.9MB/s 6989.8MB/s 7021.8MB/s WRITE: 2438.1MB/s 2346.9MB/s 3400.2MB/s WRITE: 1994.2MB/s 1990.3MB/s 2941.2MB/s READ: 981504KB/s 973906KB/s 1018.8MB/s WRITE: 981659KB/s 974060KB/s 1018.1MB/s READ: 937021KB/s 938976KB/s 987250KB/s WRITE: 934878KB/s 936830KB/s 984993KB/s #jobs3 READ: 13280MB/s13553MB/s 13553MB/s READ: 11534MB/s11785MB/s 11755MB/s WRITE: 3456.9MB/s 3469.9MB/s 4810.3MB/s WRITE: 3029.6MB/s 3031.6MB/s 4264.8MB/s READ: 1363.8MB/s 1362.6MB/s 1448.9MB/s WRITE: 1361.9MB/s 1360.7MB/s 1446.9MB/s READ: 1309.4MB/s 1310.6MB/s 1397.5MB/s WRITE: 1307.4MB/s 1308.5MB/s 1395.3MB/s #jobs4 READ: 20244MB/s20177MB/s 20344MB/s READ: 17886MB/s17913MB/s 17835MB/s WRITE: 4071.6MB/s 4046.1MB/s 6370.2MB/s WRITE: 3608.9MB/s 3576.3MB/s 5785.4MB/s READ: 1824.3MB/s 1821.6MB/s 1997.5MB/s WRITE: 1819.8MB/s 1817.4MB/s 1992.5MB/s READ: 1765.7MB/s 1768.3MB/s 1937.3MB/s WRITE: 1767.5MB/s 1769.1MB/s 1939.2MB/s #jobs5 READ: 18663MB/s18986MB/s 18823MB/s READ: 16659MB/s16605MB/s 16954MB/s WRITE: 3912.4MB/s 3888.7MB/s 6126.9MB/s WRITE: 3506.4MB/s 3442.5MB/s 5519.3MB/s READ: 1798.2MB/s 1746.5MB/s 1935.8MB/s WRITE: 1792.7MB/s 1740.7MB/s 1929.1MB/s READ: 1727.6MB/s 1658.2MB/s 1917.3MB/s WRITE: 1726.5MB/s 1657.2MB/s 1916.6MB/s #jobs6 READ: 21017MB/s20922MB/s 21162MB/s READ: 19022MB/s19140MB/s 18770MB/s WRITE: 3968.2MB/s 4037.7MB/s 6620.8MB/s WRITE: 3643.5MB/s 3590.2MB/s 6027.5MB/s READ: 1871.8MB/s 1880.5MB/s 2049.9MB/s WRITE: 1867.8MB/s 1877.2MB/s 2046.2MB/s READ: 1755.8MB/s 1710.3MB/s 1964.7MB/s WRITE: 1750.5MB/s 1705.9MB/s 1958.8MB/s #jobs7 READ: 21103MB/s20677MB/s 21482MB/s READ: 18522MB/s18379MB/s 19443MB/s WRITE: 4022.5MB/s 4067.4MB/s 6755.9MB/s WRITE: 3691.7MB/s 3695.5MB/s 5925.6MB/s READ: 1841.5MB/s 1933.9MB/s 2090.5MB/s WRITE: 1842.7MB/s 1935.3MB/s 2091.9MB/s READ: 1832.4MB/s 1856.4MB/s 1971.5MB/s WRITE: 1822.3MB/s 1846.2MB/s 1960.6MB/s #jobs8 READ: 20463MB/s20194MB/s 20862MB/s READ: 18178MB/s17978MB/s 18299MB/s WRITE: 4085.9MB/s 4060.2MB/s 7023.8MB/s WRITE: 3776.3MB/s 3737.9MB/s 6278.2MB/s READ: 1957.6MB/s 1944.4MB/s 2109.5MB/s WRITE: 1959.2MB/s 1946.2MB/s 2111.4MB/s READ: 1900.6MB/s 1885.7MB/s 2082.1MB/s WRITE: 1896.2MB/s 1881.4MB/s 2078.3MB/s #jobs9 READ: 19692MB/s19734MB/s 19334MB/s READ: 17678MB/s18249MB/s 17666MB/s WRITE: 4004.7MB/s 4064.8MB/s 6990.7MB/s WRITE: 3724.7MB/s 3
Re: zram: per-cpu compression streams
Hello Minchan, On (04/04/16 09:27), Minchan Kim wrote: > Hello Sergey, > > On Sat, Apr 02, 2016 at 12:38:29AM +0900, Sergey Senozhatsky wrote: > > Hello Minchan, > > > > On (03/31/16 15:34), Sergey Senozhatsky wrote: > > > > I tested with you suggested parameter. > > > > In my side, win is better compared to my previous test but it seems > > > > your test is so fast. IOW, filesize is small and loops is just 1. > > > > Please test filesize=500m loops=10 or 20. > > > > fio > > - loops=10 > > - buffer_pattern=0xbadc0ffee > > > > zram 6G. no intel p-state, deadline IO scheduler, no lockdep (no lock > > debugging). > > We are using rw_page so I/O scheduler is not related. yes, agree. added it just in case. > Anyway, I configured my machine as you said but still see 10~20% enhance. :( > Hmm, could you post your .config? oh, sorry, completely forgot about it. attached. > I want to investigate why such difference happens between our machines. > > The reason I want to see such *big enhance* in my machine is that > as you know, with per-cpu, zram's write path will lose blockable section > so it would make upcoming features's implementation hard. I see. well, depending on what new features are about to come in, we can utilize the same per-cpu mechanism if we are talking about some sort of buffers, streams, etc. > We also should test it in very low memory situation so every write path > retry it(i.e., dobule compression). With it, I want to see how many > performance can drop. one of the boxen I use has only 4G of memory, so "re-compressions" do happen there. I can add a simple counter (just for testing purposes) to see how often. > If both test(normal: huge win low memory: small regression) are fine, > we can go per-cpu approach at the cost of giving up blockable section. > :) yep. -ss # # Automatically generated file; DO NOT EDIT. # Linux/x86 4.6.0-rc1 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_MMU=y CONFIG_ARCH_MMAP_RND_BITS_MIN=28 CONFIG_ARCH_MMAP_RND_BITS_MAX=32 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8 CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16 CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_HAVE_INTEL_TXT=y CONFIG_X86_64_SMP=y CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEBUG_RODATA=y CONFIG_PGTABLE_LEVELS=4 CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="-dbg" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="swordfish" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_CROSS_MEMORY_ATTACH is not set CONFIG_FHANDLE=y # CONFIG_USELIB is not set # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y # CONFIG_IRQ_DOMAIN_DEBUG is not set CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # # CPU/Task time and stats accountin
Re: zram: per-cpu compression streams
Hello Sergey, On Sat, Apr 02, 2016 at 12:38:29AM +0900, Sergey Senozhatsky wrote: > Hello Minchan, > > On (03/31/16 15:34), Sergey Senozhatsky wrote: > > > I tested with you suggested parameter. > > > In my side, win is better compared to my previous test but it seems > > > your test is so fast. IOW, filesize is small and loops is just 1. > > > Please test filesize=500m loops=10 or 20. > > fio > - loops=10 > - buffer_pattern=0xbadc0ffee > > zram 6G. no intel p-state, deadline IO scheduler, no lockdep (no lock > debugging). We are using rw_page so I/O scheduler is not related. Anyway, I configured my machine as you said but still see 10~20% enhance. :( Hmm, could you post your .config? I want to investigate why such difference happens between our machines. The reason I want to see such *big enhance* in my machine is that as you know, with per-cpu, zram's write path will lose blockable section so it would make upcoming features's implementation hard. We also should test it in very low memory situation so every write path retry it(i.e., dobule compression). With it, I want to see how many performance can drop. If both test(normal: huge win low memory: small regression) are fine, we can go per-cpu approach at the cost of giving up blockable section. :) Thanks. > > > test8 streamsper-cpu > > #jobs1 > READ: 4118.2MB/s 4105.3MB/s > READ: 3487.7MB/s 3624.9MB/s > WRITE: 2197.8MB/s 2305.1MB/s > WRITE: 1776.2MB/s 1887.5MB/s > READ: 736589KB/s 745648KB/s > WRITE: 736353KB/s 745409KB/s > READ: 679279KB/s 686559KB/s > WRITE: 679093KB/s 686371KB/s > #jobs2 > READ: 6924.6MB/s 7160.2MB/s > READ: 6213.2MB/s 6247.1MB/s > WRITE: 2510.3MB/s 3680.1MB/s > WRITE: 2286.2MB/s 3153.9MB/s > READ: 1163.1MB/s 1333.7MB/s > WRITE: 1163.4MB/s 1332.2MB/s > READ: 1122.9MB/s 1240.3MB/s > WRITE: 1121.9MB/s 1239.2MB/s > #jobs3 > READ: 10304MB/s 10424MB/s > READ: 9014.5MB/s 9014.5MB/s > WRITE: 3883.9MB/s 5373.8MB/s > WRITE: 3549.1MB/s 4576.4MB/s > READ: 1704.4MB/s 1916.8MB/s > WRITE: 1704.9MB/s 1915.9MB/s > READ: 1603.5MB/s 1806.8MB/s > WRITE: 1598.8MB/s 1800.8MB/s > #jobs4 > READ: 13509MB/s 12792MB/s > READ: 10899MB/s 11434MB/s > WRITE: 4027.2MB/s 6272.8MB/s > WRITE: 3902.1MB/s 5389.2MB/s > READ: 2090.9MB/s 2344.4MB/s > WRITE: 2085.2MB/s 2337.1MB/s > READ: 1968.1MB/s 2185.9MB/s > WRITE: 1969.5MB/s 2186.4MB/s > #jobs5 > READ: 12634MB/s 11607MB/s > READ: 9932.7MB/s 9980.6MB/s > WRITE: 4275.8MB/s 5844.3MB/s > WRITE: 4210.1MB/s 5262.3MB/s > READ: 1995.6MB/s 2211.4MB/s > WRITE: 1988.4MB/s 2203.4MB/s > READ: 1930.1MB/s 2191.8MB/s > WRITE: 1929.8MB/s 2190.3MB/s > #jobs6 > READ: 12270MB/s 13012MB/s > READ: 11221MB/s 10815MB/s > WRITE: 4643.4MB/s 6090.9MB/s > WRITE: 4373.6MB/s 5772.8MB/s > READ: 2232.6MB/s 2358.4MB/s > WRITE: 2233.4MB/s 2359.2MB/s > READ: 2082.6MB/s 2285.8MB/s > WRITE: 2075.9MB/s 2278.1MB/s > #jobs7 > READ: 13617MB/s 14172MB/s > READ: 12290MB/s 11734MB/s > WRITE: 5077.3MB/s 6315.7MB/s > WRITE: 4719.4MB/s 5825.1MB/s > READ: 2379.8MB/s 2523.7MB/s > WRITE: 2373.7MB/s 2516.7MB/s > READ: 2287.9MB/s 2362.4MB/s > WRITE: 2283.9MB/s 2358.2MB/s > #jobs8 > READ: 15130MB/s 15533MB/s > READ: 12952MB/s 13077MB/s > WRITE: 5586.6MB/s 7108.2MB/s > WRITE: 5233.5MB/s 6591.3MB/s > READ: 2541.2MB/s 2709.2MB/s > WRITE: 2544.6MB/s 2713.2MB/s > READ: 2450.6MB/s 2590.7MB/s > WRITE: 2449.4MB/s 2589.3MB/s > #jobs9 > READ: 13480MB/s 13909MB/s > READ: 12389MB/s 12000MB/s > WRITE: 5266.8MB/s 6594.9MB/s > WRITE: 4971.6MB/s 6442.2MB/s > READ: 2464.9MB/s 2470.9MB/s > WRITE: 2482.7MB/s 2488.8MB/s > READ: 2171.9MB/s 2402.2MB/s > WRITE: 2174.9MB/s
Re: zram: per-cpu compression streams
Hello Minchan, On (03/31/16 15:34), Sergey Senozhatsky wrote: > > I tested with you suggested parameter. > > In my side, win is better compared to my previous test but it seems > > your test is so fast. IOW, filesize is small and loops is just 1. > > Please test filesize=500m loops=10 or 20. fio - loops=10 - buffer_pattern=0xbadc0ffee zram 6G. no intel p-state, deadline IO scheduler, no lockdep (no lock debugging). test8 streamsper-cpu #jobs1 READ: 4118.2MB/s 4105.3MB/s READ: 3487.7MB/s 3624.9MB/s WRITE: 2197.8MB/s 2305.1MB/s WRITE: 1776.2MB/s 1887.5MB/s READ: 736589KB/s 745648KB/s WRITE: 736353KB/s 745409KB/s READ: 679279KB/s 686559KB/s WRITE: 679093KB/s 686371KB/s #jobs2 READ: 6924.6MB/s 7160.2MB/s READ: 6213.2MB/s 6247.1MB/s WRITE: 2510.3MB/s 3680.1MB/s WRITE: 2286.2MB/s 3153.9MB/s READ: 1163.1MB/s 1333.7MB/s WRITE: 1163.4MB/s 1332.2MB/s READ: 1122.9MB/s 1240.3MB/s WRITE: 1121.9MB/s 1239.2MB/s #jobs3 READ: 10304MB/s10424MB/s READ: 9014.5MB/s 9014.5MB/s WRITE: 3883.9MB/s 5373.8MB/s WRITE: 3549.1MB/s 4576.4MB/s READ: 1704.4MB/s 1916.8MB/s WRITE: 1704.9MB/s 1915.9MB/s READ: 1603.5MB/s 1806.8MB/s WRITE: 1598.8MB/s 1800.8MB/s #jobs4 READ: 13509MB/s12792MB/s READ: 10899MB/s11434MB/s WRITE: 4027.2MB/s 6272.8MB/s WRITE: 3902.1MB/s 5389.2MB/s READ: 2090.9MB/s 2344.4MB/s WRITE: 2085.2MB/s 2337.1MB/s READ: 1968.1MB/s 2185.9MB/s WRITE: 1969.5MB/s 2186.4MB/s #jobs5 READ: 12634MB/s11607MB/s READ: 9932.7MB/s 9980.6MB/s WRITE: 4275.8MB/s 5844.3MB/s WRITE: 4210.1MB/s 5262.3MB/s READ: 1995.6MB/s 2211.4MB/s WRITE: 1988.4MB/s 2203.4MB/s READ: 1930.1MB/s 2191.8MB/s WRITE: 1929.8MB/s 2190.3MB/s #jobs6 READ: 12270MB/s13012MB/s READ: 11221MB/s10815MB/s WRITE: 4643.4MB/s 6090.9MB/s WRITE: 4373.6MB/s 5772.8MB/s READ: 2232.6MB/s 2358.4MB/s WRITE: 2233.4MB/s 2359.2MB/s READ: 2082.6MB/s 2285.8MB/s WRITE: 2075.9MB/s 2278.1MB/s #jobs7 READ: 13617MB/s14172MB/s READ: 12290MB/s11734MB/s WRITE: 5077.3MB/s 6315.7MB/s WRITE: 4719.4MB/s 5825.1MB/s READ: 2379.8MB/s 2523.7MB/s WRITE: 2373.7MB/s 2516.7MB/s READ: 2287.9MB/s 2362.4MB/s WRITE: 2283.9MB/s 2358.2MB/s #jobs8 READ: 15130MB/s15533MB/s READ: 12952MB/s13077MB/s WRITE: 5586.6MB/s 7108.2MB/s WRITE: 5233.5MB/s 6591.3MB/s READ: 2541.2MB/s 2709.2MB/s WRITE: 2544.6MB/s 2713.2MB/s READ: 2450.6MB/s 2590.7MB/s WRITE: 2449.4MB/s 2589.3MB/s #jobs9 READ: 13480MB/s13909MB/s READ: 12389MB/s12000MB/s WRITE: 5266.8MB/s 6594.9MB/s WRITE: 4971.6MB/s 6442.2MB/s READ: 2464.9MB/s 2470.9MB/s WRITE: 2482.7MB/s 2488.8MB/s READ: 2171.9MB/s 2402.2MB/s WRITE: 2174.9MB/s 2405.5MB/s #jobs10 READ: 14647MB/s14667MB/s READ: 11765MB/s12032MB/s WRITE: 5248.7MB/s 6740.4MB/s WRITE: 4779.8MB/s 5822.8MB/s READ: 2448.8MB/s 2585.3MB/s WRITE: 2449.4MB/s 2585.9MB/s READ: 2290.5MB/s 2409.1MB/s WRITE: 2290.2MB/s 2409.7MB/s -ss
Re: zram: per-cpu compression streams
Hello Minchan, On (03/31/16 14:53), Minchan Kim wrote: > Hello Sergey, > > > that's a good question. I quickly looked into the fio source code, > > we need to use "buffer_pattern=str" option, I think. so the buffers > > will be filled with the same data. > > > > I don't mind to have buffer_compress_percentage as a separate test (set > > as a local test option), but I think that using common buffer pattern > > adds more confidence when we compare test results. > > If we both uses same "buffer_compress_percentage=something", it's > good to compare. The benefit of buffer_compress_percentage is we can > change compression ratio easily in zram testing and see various > test to see what compression ratio or speed affects the system. let's start with "common data" (buffer_pattern=str), not common compression ratio. buffer_compress_percentage=something is calculated for which compression algorithm? deflate (zlib)? or it's something else? we use lzo/lz4, common data is more predictable. [..] > > sure. > > I tested with you suggested parameter. > In my side, win is better compared to my previous test but it seems > your test is so fast. IOW, filesize is small and loops is just 1. > Please test filesize=500m loops=10 or 20. that will require 5G zram, I don't have that much ram on the box so I'll test later today on another box. I split the device size between jobs. if I have 10 jobs, then the file size of each job is DISK_SIZE/10; but in total jobs write/read DEVICE_SZ bytes. jobs start with large 1 * DEVICE_SZ/1 files and go down to 10 * DEVICE_SZ/10 files. > It can make your test more stable and enhance is 10~20% in my side. > Let's discuss further once test result between us is consistent. -ss
Re: zram: per-cpu compression streams
Hello Sergey, On Thu, Mar 31, 2016 at 10:26:26AM +0900, Sergey Senozhatsky wrote: > Hello, > > On (03/31/16 07:12), Minchan Kim wrote: > [..] > > > I used a bit different script. no `buffer_compress_percentage' option, > > > because it provide "a mix of random data and zeroes" > > > > Normally, zram's compression ratio is 3 or 2 so I used it. > > Hmm, isn't it more real practice usecase? > > this option guarantees that the supplied to zram data will have > a requested compression ratio? hm, but we never do that in real > life, zram sees random data. I agree it's hard to create such random read data with benchmark. One option is that we share swap dump data of real product, for exmaple, android or webOS and feed it to the benchmark. But as you know, it cannot cover all of workload, either. So, to just easy test, I wanted to make represntative compression ratio data and fio provides option for it via buffer_compress_percentage. It would be better rather than feeding random data which could make lots of noise for each test cycle. > > > If we don't use buffer_compress_percentage, what's the content in the > > buffer? > > that's a good question. I quickly looked into the fio source code, > we need to use "buffer_pattern=str" option, I think. so the buffers > will be filled with the same data. > > I don't mind to have buffer_compress_percentage as a separate test (set > as a local test option), but I think that using common buffer pattern > adds more confidence when we compare test results. If we both uses same "buffer_compress_percentage=something", it's good to compare. The benefit of buffer_compress_percentage is we can change compression ratio easily in zram testing and see various test to see what compression ratio or speed affects the system. > > [..] > > > hm, but I guess it's not enough; fio probably will have different > > > data (well, only if we didn't ask it to zero-fill the buffers) for > > > different tests, causing different zram->zsmalloc behaviour. need > > > to check it. > [..] > > > #jobs4 > > > READ: 8720.4MB/s 7301.7MB/s 7896.2MB/s > > > READ: 7510.3MB/s 6690.1MB/s 6456.2MB/s > > > WRITE: 2211.6MB/s 1930.8MB/s 2713.9MB/s > > > WRITE: 2002.2MB/s 1629.8MB/s 2227.7MB/s > > > > Your case is 40% win. It's huge, Nice! > > I tested with your guide line(i.e., no buffer_compress_percentage, > > scramble_buffers=0) but still 10% enhance in my machine. > > Hmm,,, > > > > How about if you test my fio job.file in your machine? > > Still, it's 40% win? > > I'll retest with new config. > > > Also, I want to test again in your exactly same configuration. > > Could you tell me zram environment(ie, disksize, compression > > algorithm) and share me your job.file of fio? > > sure. I tested with you suggested parameter. In my side, win is better compared to my previous test but it seems your test is so fast. IOW, filesize is small and loops is just 1. Please test filesize=500m loops=10 or 20. It can make your test more stable and enhance is 10~20% in my side. Let's discuss further once test result between us is consistent. Thanks. > > 3G, lzo > > > --- my fio-template is > > [global] > bs=4k > ioengine=sync > direct=1 > size=__SIZE__ > numjobs=__JOBS__ > group_reporting > filename=/dev/zram0 > loops=1 > buffer_pattern=0xbadc0ffee > scramble_buffers=0 > > [seq-read] > rw=read > stonewall > > [rand-read] > rw=randread > stonewall > > [seq-write] > rw=write > stonewall > > [rand-write] > rw=randwrite > stonewall > > [mixed-seq] > rw=rw > stonewall > > [mixed-rand] > rw=randrw > stonewall > > > #separate test with > #buffer_compress_percentage=50 > > > > --- my create-zram script is as follows. > > > #!/bin/sh > > rmmod zram > modprobe zram > > if [ -e /sys/block/zram0/initstate ]; then > initdone=`cat /sys/block/zram0/initstate` > if [ $initdone = 1 ]; then > echo "init done" > exit 1 > fi > fi > > echo 8 > /sys/block/zram0/max_comp_streams > > echo lzo > /sys/block/zram0/comp_algorithm > cat /sys/block/zram0/comp_algorithm > > cat /sys/block/zram0/max_comp_streams > echo $1 > /sys/block/zram0/disksize > > > > > > --- and I use it as > > > #!/bin/sh > > DEVICE_SZ=$((3 * 1024 * 1024 * 1024)) > FREE_SPACE=$(($DEVICE_SZ / 10)) > LOG=/tmp/fio-zram-test > LOG_SUFFIX=$1 > > function reset_zram > { > rmmod zram > } > > function create_zram > { > ./create-zram $DEVICE_SZ > } > > function main > { > local j > local i > > if [ "z$LOG_SUFFIX" = "z" ]; then > LOG_SUFFIX="UNSET" > fi > > LOG=$LOG-$LOG_SUFFIX > > for i in {1..10}; do > reset_zram > create_zram > > cat fio-test-template | sed s/__JOBS__/$i/ | sed > s/__SIZE__/$((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M/ > fio-test >
Re: zram: per-cpu compression streams
Hello, On (03/31/16 07:12), Minchan Kim wrote: [..] > > I used a bit different script. no `buffer_compress_percentage' option, > > because it provide "a mix of random data and zeroes" > > Normally, zram's compression ratio is 3 or 2 so I used it. > Hmm, isn't it more real practice usecase? this option guarantees that the supplied to zram data will have a requested compression ratio? hm, but we never do that in real life, zram sees random data. > If we don't use buffer_compress_percentage, what's the content in the buffer? that's a good question. I quickly looked into the fio source code, we need to use "buffer_pattern=str" option, I think. so the buffers will be filled with the same data. I don't mind to have buffer_compress_percentage as a separate test (set as a local test option), but I think that using common buffer pattern adds more confidence when we compare test results. [..] > > hm, but I guess it's not enough; fio probably will have different > > data (well, only if we didn't ask it to zero-fill the buffers) for > > different tests, causing different zram->zsmalloc behaviour. need > > to check it. [..] > > #jobs4 > > READ: 8720.4MB/s7301.7MB/s 7896.2MB/s > > READ: 7510.3MB/s6690.1MB/s 6456.2MB/s > > WRITE: 2211.6MB/s1930.8MB/s 2713.9MB/s > > WRITE: 2002.2MB/s1629.8MB/s 2227.7MB/s > > Your case is 40% win. It's huge, Nice! > I tested with your guide line(i.e., no buffer_compress_percentage, > scramble_buffers=0) but still 10% enhance in my machine. > Hmm,,, > > How about if you test my fio job.file in your machine? > Still, it's 40% win? I'll retest with new config. > Also, I want to test again in your exactly same configuration. > Could you tell me zram environment(ie, disksize, compression > algorithm) and share me your job.file of fio? sure. 3G, lzo --- my fio-template is [global] bs=4k ioengine=sync direct=1 size=__SIZE__ numjobs=__JOBS__ group_reporting filename=/dev/zram0 loops=1 buffer_pattern=0xbadc0ffee scramble_buffers=0 [seq-read] rw=read stonewall [rand-read] rw=randread stonewall [seq-write] rw=write stonewall [rand-write] rw=randwrite stonewall [mixed-seq] rw=rw stonewall [mixed-rand] rw=randrw stonewall #separate test with #buffer_compress_percentage=50 --- my create-zram script is as follows. #!/bin/sh rmmod zram modprobe zram if [ -e /sys/block/zram0/initstate ]; then initdone=`cat /sys/block/zram0/initstate` if [ $initdone = 1 ]; then echo "init done" exit 1 fi fi echo 8 > /sys/block/zram0/max_comp_streams echo lzo > /sys/block/zram0/comp_algorithm cat /sys/block/zram0/comp_algorithm cat /sys/block/zram0/max_comp_streams echo $1 > /sys/block/zram0/disksize --- and I use it as #!/bin/sh DEVICE_SZ=$((3 * 1024 * 1024 * 1024)) FREE_SPACE=$(($DEVICE_SZ / 10)) LOG=/tmp/fio-zram-test LOG_SUFFIX=$1 function reset_zram { rmmod zram } function create_zram { ./create-zram $DEVICE_SZ } function main { local j local i if [ "z$LOG_SUFFIX" = "z" ]; then LOG_SUFFIX="UNSET" fi LOG=$LOG-$LOG_SUFFIX for i in {1..10}; do reset_zram create_zram cat fio-test-template | sed s/__JOBS__/$i/ | sed s/__SIZE__/$((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M/ > fio-test echo "#jobs$i" >> $LOG time fio ./fio-test >> $LOG done reset_zram } main -- then I use this simple script #!/bin/sh if [ "z$2" = "z" ]; then cat $1 | egrep "#jobs|READ|WRITE" | awk '{printf "%-15s %15s\n", $1, $3}' | sed s/aggrb=// | sed s/,// else cat $1 | egrep "#jobs|READ|WRITE" | awk '{printf " %-15s\n", $3}' | sed s/aggrb=// | sed s/\#jobs[0-9]*// | sed s/,// fi as ./squeeze.sh fio-zram-test-4-stream > 4s ./squeeze.sh fio-zram-test-8-stream A > 8s ./squeeze.sh fio-zram-test-per-cpu A > pc and paste 4s 8s pc > result -ss
Re: zram: per-cpu compression streams
On Wed, Mar 30, 2016 at 05:34:19PM +0900, Sergey Senozhatsky wrote: > Hello Minchan, > sorry for long reply. > > On (03/28/16 12:21), Minchan Kim wrote: > [..] > > group_reporting > > buffer_compress_percentage=50 > > filename=/dev/zram0 > > loops=10 > > I used a bit different script. no `buffer_compress_percentage' option, > because it provide "a mix of random data and zeroes" Normally, zram's compression ratio is 3 or 2 so I used it. Hmm, isn't it more real practice usecase? If we don't use buffer_compress_percentage, what's the content in the buffer? > > buffer_compress_percentage=int > If this is set, then fio will attempt to provide IO buffer content > (on WRITEs) that compress to the specified level. Fio does this by > providing a mix of random data and zeroes > > and I also used scramble_buffers=0. but default scramble_buffers is > true, so > > scramble_buffers=bool > If refill_buffers is too costly and the target is using data > deduplication, then setting this option will slightly modify the IO > buffer contents to defeat normal de-dupe attempts. This is not > enough to defeat more clever block compression attempts, but it will > stop naive dedupe of blocks. Default: true. > > hm, but I guess it's not enough; fio probably will have different > data (well, only if we didn't ask it to zero-fill the buffers) for > different tests, causing different zram->zsmalloc behaviour. need > to check it. > > > > Hmm, Could you retest to who how the benefit is big? > > sure. the results are: > > - seq-read > - rand-read > - seq-write > - rand-write (READ + WRITE) > - mixed-seq > - mixed-rand (READ + WRITE) > > TEST4 streams 8 streams per-cpu > > #jobs1 > READ: 2665.4MB/s 2515.2MB/s 2632.4MB/s > READ: 2258.2MB/s 2055.2MB/s 2166.2MB/s > WRITE: 933180KB/s 894260KB/s 898234KB/s > WRITE: 765576KB/s 728154KB/s 746396KB/s > READ: 563169KB/s 541004KB/s 551541KB/s > WRITE: 562660KB/s 540515KB/s 551043KB/s > READ: 493656KB/s 477990KB/s 488041KB/s > WRITE: 493210KB/s 477558KB/s 487600KB/s > #jobs2 > READ: 5116.7MB/s 4607.1MB/s 4401.5MB/s > READ: 4401.5MB/s 3993.6MB/s 3831.6MB/s > WRITE: 1539.9MB/s 1425.5MB/s 1600.0MB/s > WRITE: 1311.1MB/s 1228.7MB/s 1380.6MB/s > READ: 1001.8MB/s 960799KB/s 989.63MB/s > WRITE: 998.31MB/s 957540KB/s 986.26MB/s > READ: 921439KB/s 860387KB/s 899720KB/s > WRITE: 918314KB/s 857469KB/s 896668KB/s > #jobs3 > READ: 6670.9MB/s 6469.9MB/s 6548.8MB/s > READ: 5743.4MB/s 5507.8MB/s 5608.4MB/s > WRITE: 1923.8MB/s 1885.9MB/s 2191.9MB/s > WRITE: 1622.4MB/s 1605.4MB/s 1842.2MB/s > READ: 1277.3MB/s 1295.8MB/s 1395.2MB/s > WRITE: 1276.9MB/s 1295.4MB/s 1394.7MB/s > READ: 1152.6MB/s 1137.1MB/s 1216.6MB/s > WRITE: 1152.2MB/s 1137.6MB/s 1216.2MB/s > #jobs4 > READ: 8720.4MB/s 7301.7MB/s 7896.2MB/s > READ: 7510.3MB/s 6690.1MB/s 6456.2MB/s > WRITE: 2211.6MB/s 1930.8MB/s 2713.9MB/s > WRITE: 2002.2MB/s 1629.8MB/s 2227.7MB/s Your case is 40% win. It's huge, Nice! I tested with your guide line(i.e., no buffer_compress_percentage, scramble_buffers=0) but still 10% enhance in my machine. Hmm,,, How about if you test my fio job.file in your machine? Still, it's 40% win? Also, I want to test again in your exactly same configuration. Could you tell me zram environment(ie, disksize, compression algorithm) and share me your job.file of fio? Thanks.
Re: zram: per-cpu compression streams
Hello Minchan, sorry for long reply. On (03/28/16 12:21), Minchan Kim wrote: [..] > group_reporting > buffer_compress_percentage=50 > filename=/dev/zram0 > loops=10 I used a bit different script. no `buffer_compress_percentage' option, because it provide "a mix of random data and zeroes" buffer_compress_percentage=int If this is set, then fio will attempt to provide IO buffer content (on WRITEs) that compress to the specified level. Fio does this by providing a mix of random data and zeroes and I also used scramble_buffers=0. but default scramble_buffers is true, so scramble_buffers=bool If refill_buffers is too costly and the target is using data deduplication, then setting this option will slightly modify the IO buffer contents to defeat normal de-dupe attempts. This is not enough to defeat more clever block compression attempts, but it will stop naive dedupe of blocks. Default: true. hm, but I guess it's not enough; fio probably will have different data (well, only if we didn't ask it to zero-fill the buffers) for different tests, causing different zram->zsmalloc behaviour. need to check it. > Hmm, Could you retest to who how the benefit is big? sure. the results are: - seq-read - rand-read - seq-write - rand-write (READ + WRITE) - mixed-seq - mixed-rand (READ + WRITE) TEST4 streams 8 streams per-cpu #jobs1 READ: 2665.4MB/s2515.2MB/s 2632.4MB/s READ: 2258.2MB/s2055.2MB/s 2166.2MB/s WRITE: 933180KB/s894260KB/s 898234KB/s WRITE: 765576KB/s728154KB/s 746396KB/s READ: 563169KB/s541004KB/s 551541KB/s WRITE: 562660KB/s540515KB/s 551043KB/s READ: 493656KB/s477990KB/s 488041KB/s WRITE: 493210KB/s477558KB/s 487600KB/s #jobs2 READ: 5116.7MB/s4607.1MB/s 4401.5MB/s READ: 4401.5MB/s3993.6MB/s 3831.6MB/s WRITE: 1539.9MB/s1425.5MB/s 1600.0MB/s WRITE: 1311.1MB/s1228.7MB/s 1380.6MB/s READ: 1001.8MB/s960799KB/s 989.63MB/s WRITE: 998.31MB/s957540KB/s 986.26MB/s READ: 921439KB/s860387KB/s 899720KB/s WRITE: 918314KB/s857469KB/s 896668KB/s #jobs3 READ: 6670.9MB/s6469.9MB/s 6548.8MB/s READ: 5743.4MB/s5507.8MB/s 5608.4MB/s WRITE: 1923.8MB/s1885.9MB/s 2191.9MB/s WRITE: 1622.4MB/s1605.4MB/s 1842.2MB/s READ: 1277.3MB/s1295.8MB/s 1395.2MB/s WRITE: 1276.9MB/s1295.4MB/s 1394.7MB/s READ: 1152.6MB/s1137.1MB/s 1216.6MB/s WRITE: 1152.2MB/s1137.6MB/s 1216.2MB/s #jobs4 READ: 8720.4MB/s7301.7MB/s 7896.2MB/s READ: 7510.3MB/s6690.1MB/s 6456.2MB/s WRITE: 2211.6MB/s1930.8MB/s 2713.9MB/s WRITE: 2002.2MB/s1629.8MB/s 2227.7MB/s READ: 1657.8MB/s1437.1MB/s 1765.8MB/s WRITE: 1651.7MB/s1432.7MB/s 1759.3MB/s READ: 1467.7MB/s1201.7MB/s 1523.5MB/s WRITE: 1462.3MB/s1197.3MB/s 1517.9MB/s #jobs5 READ: 7791.9MB/s6852.7MB/s 7487.9MB/s READ: 6214.6MB/s6449.6MB/s 7106.5MB/s WRITE: 2017.9MB/s1978.1MB/s 2221.5MB/s WRITE: 1913.1MB/s1664.9MB/s 1985.8MB/s READ: 1417.6MB/s1447.7MB/s 1558.8MB/s WRITE: 1419.8MB/s1449.3MB/s 1561.2MB/s READ: 1336.9MB/s1234.1MB/s 1404.7MB/s WRITE: 1338.2MB/s1236.9MB/s 1406.8MB/s #jobs6 READ: 8680.9MB/s8500.0MB/s 7116.3MB/s READ: 7329.4MB/s6580.7MB/s 6476.2MB/s WRITE: 2121.4MB/s1918.6MB/s 2472.8MB/s WRITE: 1936.8MB/s1826.9MB/s 2106.8MB/s READ: 1559.9MB/s1506.3MB/s 1643.6MB/s WRITE: 1554.7MB/s1501.2MB/s 1637.2MB/s READ: 1459.7MB/s1258.9MB/s 1502.6MB/s WRITE: 1454.8MB/s1254.6MB/s 1497.5MB/s #jobs7 READ: 9170.0MB/s7905.2MB/s 8043.9MB/s READ: 6412.7MB/s6792.7MB/s 6457.8MB/s WRITE: 2042.4MB/s1972.5MB/s 2400.6MB/s WRITE: 1938.8MB/s1808.7MB/s 2152.6MB/s READ: 1634.9MB/s1505.8MB/s 1746.4MB/s WRITE: 1640.1MB/s1511.4MB/s 1753.7MB/s READ: 1407.9MB/s1239.1MB/s 1480.8MB/s WRITE: 1413.8MB/s1245.2MB/s 1486.1MB/s #jobs8 READ: 8563.4MB/s8106.7MB/s 7696.3MB/s READ: 6909.1MB/s5790.5MB/s 6537.7MB/s WRITE: 2040.3MB/s2061.2MB/s 2481.7MB/s WRITE: 1993.5MB/s1859.4MB/s 2171.5MB/s READ: 1691.6MB/s1585.8MB/s 1749.9MB/s WRITE: 1686.3MB/s
Re: zram: per-cpu compression streams
Hi Sergey, On Fri, Mar 25, 2016 at 10:47:06AM +0900, Sergey Senozhatsky wrote: > Hello Minchan, > > On (03/25/16 08:41), Minchan Kim wrote: > [..] > > > Test #10 iozone -t 10 -R -r 80K -s 0M -I +Z > > >Initial write3213973.56 2731512.62 4416466.25* > > > Rewrite3066956.44* 2693819.50 332671.94 > > > Read7769523.25* 2681473.75 462840.44 > > > Re-read5244861.75 5473037.00* 382183.03 > > > Reverse Read7479397.25* 4869597.75 374714.06 > > > Stride read5403282.50* 5385083.75 382473.44 > > > Random read5131997.25 5176799.75* 380593.56 > > > Mixed workload3998043.25 4219049.00* 1645850.45 > > > Random write3452832.88 3290861.69 3588531.75* > > > Pwrite3757435.81 2711756.47 4561807.88* > > >Pread2743595.25* 2635835.00 412947.98 > > > Fwrite 16076549.00 16741977.25*14797209.38 > > >Fread 23581812.62*21664184.25 5064296.97 > > > = real 0m44.490s 0m44.444s 0m44.609s > > > = user 0m0.054s0m0.049s0m0.055s > > > = sys 0m0.037s0m0.046s0m0.148s > > > > > > > > > so when the number of active tasks become larger than the number > > > of online CPUS, iozone reports a bit hard to understand data. I > > > can assume that since now we keep the preemption disabled longer > > > in write path, a concurrent operation (READ or WRITE) cannot preempt > > > current anymore... slightly suspicious. > > > > > > the other hard to understand thing is why do READ-only tests have > > > such a huge jitter. READ-only tests don't depend on streams, they > > > don't even use them, we supply compressed data directly to > > > decompression api. > > > > > > may be better retire iozone and never use it again. > > > > > > > > > "118 insertions(+), 238 deletions(-)" the patches remove a big > > > pile of code. > > > > First of all, I appreciate you very much! > > thanks! > > > At a glance, on write workload, huge win but worth to investigate > > how such fluctuation/regression happens on read-related test > > (read and mixed workload). > > yes, was going to investigate in more details but got interrupted, > will return back to it today/tomorrow. > > > Could you send your patchset? I will test it. > > oh, sorry, sure! attached (because it's not a real patch submission > yet, but they look more or less ready I guess). > > patches are against next-20160324. Thanks, I tested your patch with fio. My laptop is 8G ram, 4 CPU. job file is here. = [global] bs=4k ioengine=sync direct=1 size=100m numjobs=${NUMJOBS} group_reporting buffer_compress_percentage=50 filename=/dev/zram0 loops=10 [seq-read] rw=read stonewall [rand-read] rw=randread stonewall [seq-write] rw=write stonewall [rand-write] rw=randwrite stonewall [mixed-seq] rw=rw stonewall [mixed-rand] rw=randrw stonewall = = old(ie, spinlock) version = 1) NR_PROCESS:8 NR_STREAM: 1 seq-read: (groupid=0, jobs=8): err= 0: pid=23148: Mon Mar 28 12:07:15 2016 read : io=8000.0MB, bw=5925.1MB/s, iops=1517.4K, runt= 1350msec rand-read: (groupid=1, jobs=8): err= 0: pid=23156: Mon Mar 28 12:07:15 2016 read : io=8000.0MB, bw=4889.1MB/s, iops=1251.9K, runt= 1636msec seq-write: (groupid=2, jobs=8): err= 0: pid=23164: Mon Mar 28 12:07:15 2016 write: io=8000.0MB, bw=914898KB/s, iops=228724, runt= 8954msec rand-write: (groupid=3, jobs=8): err= 0: pid=23172: Mon Mar 28 12:07:15 2016 write: io=8000.0MB, bw=913368KB/s, iops=228342, runt= 8969msec mixed-seq: (groupid=4, jobs=8): err= 0: pid=23180: Mon Mar 28 12:07:15 2016 read : io=4003.1MB, bw=881152KB/s, iops=220287, runt= 4653msec mixed-rand: (groupid=5, jobs=8): err= 0: pid=23189: Mon Mar 28 12:07:15 2016 read : io=4003.5MB, bw=837491KB/s, iops=209372, runt= 4895msec 2) NR_PROCESS:8 NR_STREAM: 8 seq-read: (groupid=0, jobs=8): err= 0: pid=23248: Mon Mar 28 12:07:57 2016 read : io=8000.0MB, bw=5847.1MB/s, iops=1497.8K, runt= 1368msec rand-read: (groupid=1, jobs=8): err= 0: pid=23256: Mon Mar 28 12:07:57 2016 read : io=8000.0MB, bw=4778.1MB/s, iops=1223.5K, runt= 1674msec seq-write: (groupid=2, jobs=8): err= 0: pid=23264: Mon Mar 28 12:07:57 2016 write: io=8000.0MB, bw=1644.7MB/s, iops=420879, runt= 4866msec rand-write: (groupid=3, jobs=8): err= 0: pid=23272: Mon Mar 28 12:07:57 2016 write: io=8000.0MB, bw=1507.5MB/s, iops=385905, runt= 5307msec mixed-seq: (groupid=4, jobs=8): err= 0: pid=23280: Mon Mar 28 12:07:57 2016 read : io=4003.1MB, bw=1225.1MB/s, iops=313839, runt= 3266msec mixed-rand: (groupid=5, jobs=8): err= 0: pid=23288: Mon Mar 28 12:07:57 2016 read : io=4003.5MB, bw=1098.4MB/s, iops=281097, runt= 3646msec 3) NR_PROCESS:8 NR_STREAM: 16 seq-read: (groupid=0, job
Re: zram: per-cpu compression streams
Hello Minchan, On (03/25/16 08:41), Minchan Kim wrote: [..] > > Test #10 iozone -t 10 -R -r 80K -s 0M -I +Z > >Initial write3213973.56 2731512.62 4416466.25* > > Rewrite3066956.44* 2693819.50 332671.94 > > Read7769523.25* 2681473.75 462840.44 > > Re-read5244861.75 5473037.00* 382183.03 > > Reverse Read7479397.25* 4869597.75 374714.06 > > Stride read5403282.50* 5385083.75 382473.44 > > Random read5131997.25 5176799.75* 380593.56 > > Mixed workload3998043.25 4219049.00* 1645850.45 > > Random write3452832.88 3290861.69 3588531.75* > > Pwrite3757435.81 2711756.47 4561807.88* > >Pread2743595.25* 2635835.00 412947.98 > > Fwrite 16076549.00 16741977.25*14797209.38 > >Fread 23581812.62*21664184.25 5064296.97 > > = real 0m44.490s 0m44.444s 0m44.609s > > = user 0m0.054s0m0.049s0m0.055s > > = sys 0m0.037s0m0.046s0m0.148s > > > > > > so when the number of active tasks become larger than the number > > of online CPUS, iozone reports a bit hard to understand data. I > > can assume that since now we keep the preemption disabled longer > > in write path, a concurrent operation (READ or WRITE) cannot preempt > > current anymore... slightly suspicious. > > > > the other hard to understand thing is why do READ-only tests have > > such a huge jitter. READ-only tests don't depend on streams, they > > don't even use them, we supply compressed data directly to > > decompression api. > > > > may be better retire iozone and never use it again. > > > > > > "118 insertions(+), 238 deletions(-)" the patches remove a big > > pile of code. > > First of all, I appreciate you very much! thanks! > At a glance, on write workload, huge win but worth to investigate > how such fluctuation/regression happens on read-related test > (read and mixed workload). yes, was going to investigate in more details but got interrupted, will return back to it today/tomorrow. > Could you send your patchset? I will test it. oh, sorry, sure! attached (because it's not a real patch submission yet, but they look more or less ready I guess). patches are against next-20160324. -ss >From 6bf369f2180dad1c8013a4847ec09d3b9056e910 Mon Sep 17 00:00:00 2001 From: Sergey Senozhatsky Subject: [PATCH 1/2] zsmalloc: require GFP in zs_malloc() Pass GFP flags to zs_malloc() instead of using fixed ones (set during pool creation), so we can be more flexible. Apart from that, this also align zs_malloc() interface with zspool/zbud. Signed-off-by: Sergey Senozhatsky --- drivers/block/zram/zram_drv.c | 2 +- include/linux/zsmalloc.h | 2 +- mm/zsmalloc.c | 15 ++- 3 files changed, 8 insertions(+), 11 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 370c2f7..9030992 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -717,7 +717,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, src = uncmem; } - handle = zs_malloc(meta->mem_pool, clen); + handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO | __GFP_HIGHMEM); if (!handle) { pr_err("Error allocating memory for compressed page: %u, size=%zu\n", index, clen); diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h index 34eb160..6d89f8b 100644 --- a/include/linux/zsmalloc.h +++ b/include/linux/zsmalloc.h @@ -44,7 +44,7 @@ struct zs_pool; struct zs_pool *zs_create_pool(const char *name, gfp_t flags); void zs_destroy_pool(struct zs_pool *pool); -unsigned long zs_malloc(struct zs_pool *pool, size_t size); +unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags); void zs_free(struct zs_pool *pool, unsigned long obj); void *zs_map_object(struct zs_pool *pool, unsigned long handle, diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index e72efb1..19027a1 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -247,7 +247,6 @@ struct zs_pool { struct size_class **size_class; struct kmem_cache *handle_cachep; - gfp_t flags; /* allocation flags used when growing pool */ atomic_long_t pages_allocated; struct zs_pool_stats stats; @@ -295,10 +294,10 @@ static void destroy_handle_cache(struct zs_pool *pool) kmem_cache_destroy(pool->handle_cachep); } -static unsigned long alloc_handle(struct zs_pool *pool) +static unsigned long alloc_handle(struct zs_pool *pool, gfp_t gfp) { return (unsigned long)kmem_cache_alloc(pool->handle_cachep, - pool->flags & ~__GFP_HIGHMEM); + gfp & ~__GFP_HIGHMEM); } static void free_handle(struct zs_pool *pool, unsigned long handle) @@ -335,7 +334,7 @@ static v
Re: zram: per-cpu compression streams
Hi Sergey, On Wed, Mar 23, 2016 at 05:18:27PM +0900, Sergey Senozhatsky wrote: > ( was "[PATCH] zram: export the number of available comp streams" >forked from http://marc.info/?l=linux-kernel&m=145860707516861 ) > > d'oh sorry, now actually forked. > > > Hello Minchan, > > forked into a separate tread. > > > On (03/22/16 09:39), Minchan Kim wrote: > > > zram_bvec_write() > > > { > > > *get_cpu_ptr(comp-stream); > > >zcomp_compress(); > > >zs_malloc() > > > put_cpu_ptr(comp-stream); > > > } > > > > > > this, however, makes zsmalloc unhapy. pool has GFP_NOIO | __GFP_HIGHMEM > > > gfp, and GFP_NOIO is ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM. this > > > __GFP_DIRECT_RECLAIM is in the conflict with per-cpu streams, because > > > per-cpu streams require disabled preemption (up until we copy stream > > > buffer to zspage). so what options do we have here... from the top of > > > my head (w/o a lot of thinking)... > > > > Indeed. > ... > > How about this? > > > > zram_bvec_write() > > { > > retry: > > *get_cpu_ptr(comp-stream); > > zcomp_compress(); > > handle = zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM| | GFP_NOWARN) > > if (!handle) { > > put_cpu_ptr(comp-stream); > > handle = zs_malloc(gfp); > > goto retry; > > } > > put_cpu_ptr(comp-stream); > > } > > interesting. the retry jump should go higher, we have "user_mem = > kmap_atomic(page)" > which we unmap right after compression, because a) we don't need > uncompressed memory anymore b) zs_malloc() can sleep and we can't have atomic > mapping around. the nasty thing here is is_partial_io(). we need to re-do > > if (is_partial_io(bvec)) > memcpy(uncmem + offset, user_mem + bvec-bv_offset, > bvec-bv_len); > > once again in the worst case. > > so zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM | GFP_NOWARN) so far can cause > double memcpy() and double compression. just to outline this. > > > the test. > > I executed a number of iozone tests, on each iteration re-creating zram > device (3GB, LZO, EXT4. the box has 4 x86_64 CPUs). > > $DEVICE_SZ=3G > $FREE_SPACE is 10% of $DEVICE_SZ > time ./iozone -t $i -R -r $((8*$i))K -s $((($DEVICE_SZ/$i - > $FREE_SPACE)/(1024*1024)))M -I +Z > > > columns: > > TEST MAX_STREAMS 4 MAX_STREAMS 8 PER_CPU STREAMS > > > Test #1 iozone -t 1 -R -r 8K -s 2764M -I +Z >Initial write 853492.31* 835868.50 839789.56 > Rewrite1642073.88 1657255.75 1693011.50* > Read3384044.00* 3218727.25 3269109.50 > Re-read3389794.50* 3243187.00 3267422.25 > Reverse Read3209805.75* 3082040.00 3107957.25 > Stride read3100144.50* 2972280.25 2923155.25 > Random read2992249.75* 2874605.00 2854824.25 > Mixed workload2992274.75* 2878212.25 2883840.00 > Random write1471800.00 1452346.50 1515678.75* > Pwrite 802083.00 801627.31 820251.69* >Pread3443495.00* 3308659.25 3302089.00 > Fwrite1880446.88 1838607.50 1909490.00* >Fread3479614.75 3091634.75 6442964.50* > = real 1m4.170s1m4.513s1m4.123s > = user 0m0.559s0m0.518s0m0.511s > = sys 0m18.766s 0m19.264s 0m18.641s > > > Test #2 iozone -t 2 -R -r 16K -s 1228M -I +Z >Initial write2102532.12 2051809.19 2419072.50* > Rewrite2217024.25 2250930.00 3681559.00* > Read7716933.25 7898759.00 8345507.75* > Re-read7748487.75 7765282.25 8342367.50* > Reverse Read7415254.25 7552637.25 7822691.75* > Stride read7041909.50 7091049.25 7401273.00* > Random read6205044.25 673.50 7232104.25* > Mixed workload4582990.00 5271651.50 5361002.88* > Random write2591893.62 2513729.88 3660774.38* > Pwrite1873876.75 1909758.69 2087238.81* >Pread4669850.00 4651121.56 4919588.44* > Fwrite1937947.25 1940628.06 2034251.25* >Fread9930319.00 9970078.00* 9831422.50 > = real 0m53.844s 0m53.607s 0m52.528s > = user 0m0.273s0m0.289s0m0.280s > = sys 0m16.595s 0m16.478s 0m14.072s > > > Test #3 iozone -t 3 -R -r 24K -s 716M -I +Z >Initial write
Re: zram: per-cpu compression streams
( was "[PATCH] zram: export the number of available comp streams" forked from http://marc.info/?l=linux-kernel&m=145860707516861 ) d'oh sorry, now actually forked. Hello Minchan, forked into a separate tread. > On (03/22/16 09:39), Minchan Kim wrote: > > zram_bvec_write() > > { > > *get_cpu_ptr(comp-stream); > > zcomp_compress(); > > zs_malloc() > > put_cpu_ptr(comp-stream); > > } > > > > this, however, makes zsmalloc unhapy. pool has GFP_NOIO | __GFP_HIGHMEM > > gfp, and GFP_NOIO is ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM. this > > __GFP_DIRECT_RECLAIM is in the conflict with per-cpu streams, because > > per-cpu streams require disabled preemption (up until we copy stream > > buffer to zspage). so what options do we have here... from the top of > > my head (w/o a lot of thinking)... > > Indeed. ... > How about this? > > zram_bvec_write() > { > retry: > *get_cpu_ptr(comp-stream); > zcomp_compress(); > handle = zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM| | GFP_NOWARN) > if (!handle) { > put_cpu_ptr(comp-stream); > handle = zs_malloc(gfp); > goto retry; > } > put_cpu_ptr(comp-stream); > } interesting. the retry jump should go higher, we have "user_mem = kmap_atomic(page)" which we unmap right after compression, because a) we don't need uncompressed memory anymore b) zs_malloc() can sleep and we can't have atomic mapping around. the nasty thing here is is_partial_io(). we need to re-do if (is_partial_io(bvec)) memcpy(uncmem + offset, user_mem + bvec-bv_offset, bvec-bv_len); once again in the worst case. so zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM | GFP_NOWARN) so far can cause double memcpy() and double compression. just to outline this. the test. I executed a number of iozone tests, on each iteration re-creating zram device (3GB, LZO, EXT4. the box has 4 x86_64 CPUs). $DEVICE_SZ=3G $FREE_SPACE is 10% of $DEVICE_SZ time ./iozone -t $i -R -r $((8*$i))K -s $((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M -I +Z columns: TEST MAX_STREAMS 4 MAX_STREAMS 8 PER_CPU STREAMS Test #1 iozone -t 1 -R -r 8K -s 2764M -I +Z Initial write 853492.31* 835868.50 839789.56 Rewrite1642073.88 1657255.75 1693011.50* Read3384044.00* 3218727.25 3269109.50 Re-read3389794.50* 3243187.00 3267422.25 Reverse Read3209805.75* 3082040.00 3107957.25 Stride read3100144.50* 2972280.25 2923155.25 Random read2992249.75* 2874605.00 2854824.25 Mixed workload2992274.75* 2878212.25 2883840.00 Random write1471800.00 1452346.50 1515678.75* Pwrite 802083.00 801627.31 820251.69* Pread3443495.00* 3308659.25 3302089.00 Fwrite1880446.88 1838607.50 1909490.00* Fread3479614.75 3091634.75 6442964.50* = real 1m4.170s1m4.513s1m4.123s = user 0m0.559s0m0.518s0m0.511s = sys 0m18.766s 0m19.264s 0m18.641s Test #2 iozone -t 2 -R -r 16K -s 1228M -I +Z Initial write2102532.12 2051809.19 2419072.50* Rewrite2217024.25 2250930.00 3681559.00* Read7716933.25 7898759.00 8345507.75* Re-read7748487.75 7765282.25 8342367.50* Reverse Read7415254.25 7552637.25 7822691.75* Stride read7041909.50 7091049.25 7401273.00* Random read6205044.25 673.50 7232104.25* Mixed workload4582990.00 5271651.50 5361002.88* Random write2591893.62 2513729.88 3660774.38* Pwrite1873876.75 1909758.69 2087238.81* Pread4669850.00 4651121.56 4919588.44* Fwrite1937947.25 1940628.06 2034251.25* Fread9930319.00 9970078.00* 9831422.50 = real 0m53.844s 0m53.607s 0m52.528s = user 0m0.273s0m0.289s0m0.280s = sys 0m16.595s 0m16.478s 0m14.072s Test #3 iozone -t 3 -R -r 24K -s 716M -I +Z Initial write3036567.50 2998918.25 3683853.00* Rewrite3402447.88 3415685.88 5054705.38* Read 11767413.00*11133789.50 11246497.25 Re-read 11797680.50*11092592.00 11277382.00 Reverse Read 10828320.00*10157665.50 10749055.00 Stride rea