Re: zram: per-cpu compression streams

2016-04-27 Thread Sergey Senozhatsky
On (04/27/16 17:54), Sergey Senozhatsky wrote:
> #jobs4
> READ:   19948MB/s  20013MB/s
> READ:   17732MB/s  17479MB/s
> WRITE:  630690KB/s 495078KB/s
> WRITE:  1843.2MB/s 2226.9MB/s
> READ:   1603.4MB/s 1846.8MB/s
> WRITE:  1599.4MB/s 1842.2MB/s
> READ:   1547.7MB/s 1740.7MB/s
> WRITE:  1549.2MB/s 1742.4MB/s

> jobs4
> stalled-cycles-frontend 265,519,049,536 (  64.46%)  
> 221,049,841,649 (  61.81%)
> stalled-cycles-backend  146,538,881,296 (  35.57%)  
> 113,774,053,039 (  31.82%)
> instructions298,241,854,695 (0.72)  
> 278,000,866,874 (0.78)
> branches 59,531,800,053 ( 400.919)   
> 55,096,944,109 ( 427.816)
> branch-misses   285,108,083 (   0.48%)  
> 260,972,185 (   0.47%)

> seconds elapsed47.816933840   52.966896478

per-cpu in general looks better in this test (jobs4): less stalls, less
branches, less misses, better fio speeds (except for WRITE: 630690KB/s  
495078KB/s).
the system was under pressure, so quite possible that it took more time to kill 
the
process, thus execution time is in favor of 8 streams test.

-ss


Re: zram: per-cpu compression streams

2016-04-27 Thread Sergey Senozhatsky

Hello,

more tests. I did only 8streams vs per-cpu this time. the changes
to the test are:
-- mem-hogger now per-faults pages in parallel with fio
-- mem-hogger alloc size increased from 3GB to 4GB.

the system couldn't survive 4GB/4GB 
zram(buffer_compress_percentage=11)/mem-hogger
split (OOM), so I executed the 3GB/4GB test (close to system's OOM edge).

-- 4 GB x86_64
-- 3 GB zram lzo

firts, the mm_stat.

8 streams (base kernel):

3221225472 3221225472 32212254720 322122956800 < 
2752460/   0>
3221225472 3221225472 32212254720 322123366400 < 
5504124/   0>
3221225472 2912157607 29528023040 29528268800   81 < 
8253369/   0>
3221225472 2893479936 28991201280 28991365120  147 
<11003056/   0>
3221217280 2886040814 28991037440 28991283200   26 
<13748450/   0>
3221225472 2880045056 28856934400 28857180160  180 
<16503120/   0>
3221213184 2877431364 28837560320 28838092800  132 
<19259891/   0>
3221225472 2873229312 28760965120 28761333760   16 
<22016512/   0>
3221213184 2870728008 28716933120 28717260800   24 
<24768909/   0>
2899095552 2899095552 28990955520 2899132416786430 
<27523600/   0>

per-cpu:

3221225472 3221225472 32212254720 322122956800 < 
2752460/8180>
3221225472 3221225472 32212254720 322123366400 < 
5504124/   10523>
3221225472 2912157607 29528023040 29528145920  117 < 
8253369/9451>
3221225472 2893479936 28991201280 28991365120  129 
<11003056/9395>
3221217280 2886040814 28991037440 28991283200   51 
<13748450/   10879>
3221225472 2880045056 28856934400 28857180160  126 
<16503120/   10300>
3221213184 2877431364 28837724160 28838010880  252 
<19259891/   10509>
3221225472 2873229312 28761006080 28761333760   14 
<22016512/   11081>
3221213184 2870728008 28716933120 28717301760   54 
<24768909/   10770>
2899095552 2899095552 28990955520 2899136512786430 
<27523600/   10231>


mem-hogger pre-fault times

8 streams (base kernel):

[431] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f3f5d38a010 <+   6.031550428>
[470] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7fa29d414010 <+   5.242295692>
[514] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f4a7eac8010 <+   5.485469454>
[563] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f07da76b010 <+   5.563647658>
[619] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7ff5efc26010 <+   5.516866208>
[681] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f8fb896d010 <+   5.535275748>
[751] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7fb2ac6fa010 <+   4.594626366>
[825] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f355f9a0010 <+   5.075849029>
[905] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7feb16715010 <+   4.696363680>
[991] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f3a1b9f4010 <+   5.292365453>


per-cpu:

[413] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7fe8058f5010 <+   5.513944292>
[451] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f65fe753010 <+   4.742384977>
[494] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7fb99a05c010 <+   5.394711696>
[542] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f0d61c81010 <+   5.021011664>
[598] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f9abdeb6010 <+   5.094722019>
[660] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7fb192ae9010 <+   4.943961060>
[728] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f7313aeb010 <+   5.437872456>
[802] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f25ffdeb010 <+   5.422829590>
[881] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f60daa8e010 <+   4.806425351>
[970] single-alloc:  INFO: Allocated 0x1 bytes at address 
0x7f384cf04010 <+   4.982513395>


so, pre-fault time range is somewhat big. for example, from 4.696363680 to 
6.031550428 seconds.



fio
8 streamsper-cpu
===
#jobs1  
READ:   2507.8MB/s   2526.4MB/s
READ:   2043.1MB/s   1970.6MB/s
WRITE:  127100KB/s   139160KB/s
WRITE:  724488KB/s   733440KB/s
READ:   534624KB/s   540967KB/s
WRITE:  534569KB/s   540912KB/s
READ:   471165KB/s   477459KB/s
WRITE:  471233KB/s   477527KB/s
#jobs2  

Re: zram: per-cpu compression streams

2016-04-27 Thread Sergey Senozhatsky
On (04/27/16 16:55), Minchan Kim wrote:
[..]
> > > Could you test concurrent mem hogger with fio rather than pre-fault 
> > > before fio test
> > > in next submit?
> > 
> > this test will not prove anything, unfortunately. I performed it;
> > and it's impossible to guarantee even remotely stable results.
> > mem-hogger process can spend on pre-fault from 41 to 81 seconds;
> > so I'm quite sceptical about the actual value of this test.
> > 
> > > > considering buffer_compress_percentage=11, the box was under somewhat
> > > > heavy pressure.
> > > > 
> > > > now, the results
> > > 
> > > Yeb, Even, recompression case is fater than old but want to see more 
> > > heavy memory
> > > pressure case and the ratio I mentioned above.
> > 
> > I did quite heavy testing over the last 7 days, with numerous OOM kills
> > and OOM panics.
> 
> Okay, I think it's worth to merge enough and see the result.
> Please send formal patch which has recompression stat. ;-)


correction: those 41-81s spikes in mem-hogger were observed under
different scenario: 10GB zram with 6GB mem-hogger on a 4GB system.

I'll do another round of tests (with parallel mem-hogger pre-fault
and 4GB/4GB zram/mem-hogger split) and collect the number that you
asked for.

thanks!

-ss


Re: zram: per-cpu compression streams

2016-04-27 Thread Minchan Kim
On Wed, Apr 27, 2016 at 04:43:35PM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (04/27/16 16:29), Minchan Kim wrote:
> [..]
> > > the test:
> > > 
> > > -- 4 GB x86_64 box
> > > -- zram 3GB, lzo
> > > -- mem-hogger pre-faults 3GB of pages before the fio test
> > > -- fio test has been modified to have 11% compression ratio (to increase 
> > > the
> > >   chances of 
> > > re-compressions)
> > 
> > Could you test concurrent mem hogger with fio rather than pre-fault before 
> > fio test
> > in next submit?
> 
> this test will not prove anything, unfortunately. I performed it;
> and it's impossible to guarantee even remotely stable results.
> mem-hogger process can spend on pre-fault from 41 to 81 seconds;
> so I'm quite sceptical about the actual value of this test.
> 
> > > considering buffer_compress_percentage=11, the box was under somewhat
> > > heavy pressure.
> > > 
> > > now, the results
> > 
> > Yeb, Even, recompression case is fater than old but want to see more heavy 
> > memory
> > pressure case and the ratio I mentioned above.
> 
> I did quite heavy testing over the last 7 days, with numerous OOM kills
> and OOM panics.

Okay, I think it's worth to merge enough and see the result.
Please send formal patch which has recompression stat. ;-)

Thanks.


Re: zram: per-cpu compression streams

2016-04-27 Thread Sergey Senozhatsky
Hello,

On (04/27/16 16:29), Minchan Kim wrote:
[..]
> > the test:
> > 
> > -- 4 GB x86_64 box
> > -- zram 3GB, lzo
> > -- mem-hogger pre-faults 3GB of pages before the fio test
> > -- fio test has been modified to have 11% compression ratio (to increase the
> >   chances of 
> > re-compressions)
> 
> Could you test concurrent mem hogger with fio rather than pre-fault before 
> fio test
> in next submit?

this test will not prove anything, unfortunately. I performed it;
and it's impossible to guarantee even remotely stable results.
mem-hogger process can spend on pre-fault from 41 to 81 seconds;
so I'm quite sceptical about the actual value of this test.

> > considering buffer_compress_percentage=11, the box was under somewhat
> > heavy pressure.
> > 
> > now, the results
> 
> Yeb, Even, recompression case is fater than old but want to see more heavy 
> memory
> pressure case and the ratio I mentioned above.

I did quite heavy testing over the last 7 days, with numerous OOM kills
and OOM panics.

-ss


Re: zram: per-cpu compression streams

2016-04-27 Thread Minchan Kim
Hello Sergey,

On Tue, Apr 26, 2016 at 08:23:05PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (04/19/16 17:00), Minchan Kim wrote:
> [..]
> > I'm convinced now with your data. Super thanks!
> > However, as you know, we need data how bad it is in heavy memory pressure.
> > Maybe, you can test it with fio and backgound memory hogger,
> 
> it's really hard to produce stable test results when the system
> is under mem pressure.
> 
> first, I modified zram to export the re-compression number
> (put cpu stream and re-try handler allocation)
> 
> mm_stat for numjobs{1..10}. the number of re-compressions is in "< NUM>" 
> format
> 
> 3221225472 3221225472 32212254720 322122956800 <
> 6421>
> 3221225472 3221225472 32212254720 322123366400 <
> 6998>
> 3221225472 2912157607 29528023040 29528145920   84 <
> 7271>
> 3221225472 2893479936 28991201280 28991365120  156 <
> 8260>
> 3221217280 2886040814 28990996480 28991283200   78 <
> 8297>
> 3221225472 2880045056 28856934400 28857180160   54 <
> 7794>
> 3221213184 2877431364 28837560320 28838010880  144 <
> 7336>
> 3221225472 2873229312 28760965120 28761333760   28 <
> 8699>
> 3221213184 2870728008 28716933120 28717301760   30 <
> 8189>
> 2899095552 2899095552 28990955520 2899136512786430 <
> 7485>

It would be great when we see the below ratio for each test.

1-compression : 2(re)-compression

> 
> as we can see, the number of re-compressions can vary from 6421 to 8699.
> 
> 
> the test:
> 
> -- 4 GB x86_64 box
> -- zram 3GB, lzo
> -- mem-hogger pre-faults 3GB of pages before the fio test
> -- fio test has been modified to have 11% compression ratio (to increase the
>   chances of re-compressions)

Could you test concurrent mem hogger with fio rather than pre-fault before fio 
test
in next submit?

>-- buffer_compress_percentage=11
>-- scramble_buffers=0
> 
> 
> considering buffer_compress_percentage=11, the box was under somewhat
> heavy pressure.
> 
> now, the results

Yeb, Even, recompression case is fater than old but want to see more heavy 
memory
pressure case and the ratio I mentioned above.

If the result is still good, please send public patch with number.
Thanks for looking this, Sergey!

> 
> 
> fio stats
> 
> 4 streams8 streams   per cpu
> ===
> #jobs1
> READ:   2411.4MB/s 2430.4MB/s  2440.4MB/s
> READ:   2094.8MB/s 2002.7MB/s  2034.5MB/s
> WRITE:  141571KB/s 140334KB/s  143542KB/s
> WRITE:  712025KB/s 706111KB/s  745256KB/s
> READ:   531014KB/s 525250KB/s  537547KB/s
> WRITE:  530960KB/s 525197KB/s  537492KB/s
> READ:   473577KB/s 470320KB/s  476880KB/s
> WRITE:  473645KB/s 470387KB/s  476948KB/s
> #jobs2
> READ:   7897.2MB/s 8031.4MB/s  7968.9MB/s
> READ:   6864.9MB/s 6803.2MB/s  6903.4MB/s
> WRITE:  321386KB/s 314227KB/s  313101KB/s
> WRITE:  1275.3MB/s 1245.6MB/s  1383.5MB/s
> READ:   1035.5MB/s 1021.9MB/s  1098.4MB/s
> WRITE:  1035.6MB/s 1021.1MB/s  1098.6MB/s
> READ:   972014KB/s 952321KB/s  987.66MB/s
> WRITE:  969792KB/s 950144KB/s  985.40MB/s
> #jobs3
> READ:   13260MB/s  13260MB/s   13222MB/s
> READ:   11636MB/s  11636MB/s   11755MB/s
> WRITE:  511500KB/s 507730KB/s  504959KB/s
> WRITE:  1646.1MB/s 1673.9MB/s  1755.5MB/s
> READ:   1389.5MB/s 1387.2MB/s  1479.6MB/s
> WRITE:  1387.6MB/s 1385.3MB/s  1477.4MB/s
> READ:   1286.8MB/s 1289.1MB/s  1377.3MB/s
> WRITE:  1284.8MB/s 1287.1MB/s  1374.9MB/s
> #jobs4
> READ:   19851MB/s  20244MB/s   20344MB/s
> READ:   17732MB/s  17835MB/s   18097MB/s
> WRITE:  667776KB/s 655599KB/s  693464KB/s
> WRITE:  2041.2MB/s 2072.6MB/s  2474.1MB/s
> READ:   1770.1MB/s 1781.7MB/s  2035.5MB/s
> WRITE:  1765.8MB/s 1777.3MB/s  2030.5MB/s
> READ:   1641.6MB/s 1672.4MB/s  1892.5MB/s
> WRITE:  1643.2MB/s 1674.2MB/s  1894.4MB/s
> #jobs5
> READ:   19468MB/s  1848

Re: zram: per-cpu compression streams

2016-04-26 Thread Sergey Senozhatsky
Hello Minchan,

On (04/19/16 17:00), Minchan Kim wrote:
[..]
> I'm convinced now with your data. Super thanks!
> However, as you know, we need data how bad it is in heavy memory pressure.
> Maybe, you can test it with fio and backgound memory hogger,

it's really hard to produce stable test results when the system
is under mem pressure.

first, I modified zram to export the re-compression number
(put cpu stream and re-try handler allocation)

mm_stat for numjobs{1..10}. the number of re-compressions is in "< NUM>" format

3221225472 3221225472 32212254720 322122956800 <
6421>
3221225472 3221225472 32212254720 322123366400 <
6998>
3221225472 2912157607 29528023040 29528145920   84 <
7271>
3221225472 2893479936 28991201280 28991365120  156 <
8260>
3221217280 2886040814 28990996480 28991283200   78 <
8297>
3221225472 2880045056 28856934400 28857180160   54 <
7794>
3221213184 2877431364 28837560320 28838010880  144 <
7336>
3221225472 2873229312 28760965120 28761333760   28 <
8699>
3221213184 2870728008 28716933120 28717301760   30 <
8189>
2899095552 2899095552 28990955520 2899136512786430 <
7485>

as we can see, the number of re-compressions can vary from 6421 to 8699.


the test:

-- 4 GB x86_64 box
-- zram 3GB, lzo
-- mem-hogger pre-faults 3GB of pages before the fio test
-- fio test has been modified to have 11% compression ratio (to increase the
  chances of re-compressions)
   -- buffer_compress_percentage=11
   -- scramble_buffers=0


considering buffer_compress_percentage=11, the box was under somewhat
heavy pressure.

now, the results


fio stats

4 streams8 streams   per cpu
===
#jobs1  
READ:   2411.4MB/s   2430.4MB/s  2440.4MB/s
READ:   2094.8MB/s   2002.7MB/s  2034.5MB/s
WRITE:  141571KB/s   140334KB/s  143542KB/s
WRITE:  712025KB/s   706111KB/s  745256KB/s
READ:   531014KB/s   525250KB/s  537547KB/s
WRITE:  530960KB/s   525197KB/s  537492KB/s
READ:   473577KB/s   470320KB/s  476880KB/s
WRITE:  473645KB/s   470387KB/s  476948KB/s
#jobs2  
READ:   7897.2MB/s   8031.4MB/s  7968.9MB/s
READ:   6864.9MB/s   6803.2MB/s  6903.4MB/s
WRITE:  321386KB/s   314227KB/s  313101KB/s
WRITE:  1275.3MB/s   1245.6MB/s  1383.5MB/s
READ:   1035.5MB/s   1021.9MB/s  1098.4MB/s
WRITE:  1035.6MB/s   1021.1MB/s  1098.6MB/s
READ:   972014KB/s   952321KB/s  987.66MB/s
WRITE:  969792KB/s   950144KB/s  985.40MB/s
#jobs3  
READ:   13260MB/s13260MB/s   13222MB/s
READ:   11636MB/s11636MB/s   11755MB/s
WRITE:  511500KB/s   507730KB/s  504959KB/s
WRITE:  1646.1MB/s   1673.9MB/s  1755.5MB/s
READ:   1389.5MB/s   1387.2MB/s  1479.6MB/s
WRITE:  1387.6MB/s   1385.3MB/s  1477.4MB/s
READ:   1286.8MB/s   1289.1MB/s  1377.3MB/s
WRITE:  1284.8MB/s   1287.1MB/s  1374.9MB/s
#jobs4  
READ:   19851MB/s20244MB/s   20344MB/s
READ:   17732MB/s17835MB/s   18097MB/s
WRITE:  667776KB/s   655599KB/s  693464KB/s
WRITE:  2041.2MB/s   2072.6MB/s  2474.1MB/s
READ:   1770.1MB/s   1781.7MB/s  2035.5MB/s
WRITE:  1765.8MB/s   1777.3MB/s  2030.5MB/s
READ:   1641.6MB/s   1672.4MB/s  1892.5MB/s
WRITE:  1643.2MB/s   1674.2MB/s  1894.4MB/s
#jobs5  
READ:   19468MB/s18484MB/s   18439MB/s
READ:   17594MB/s17757MB/s   17716MB/s
WRITE:  843266KB/s   859627KB/s  867928KB/s
WRITE:  1927.1MB/s   2041.8MB/s  2168.9MB/s
READ:   1718.6MB/s   1771.7MB/s  1963.5MB/s
WRITE:  1712.7MB/s   1765.6MB/s  1956.8MB/s
READ:   1705.3MB/s   1663.6MB/s  1767.3MB/s
WRITE:  1704.3MB/s   1662.6MB/s  1766.2MB/s
#jobs6  
READ:   21583MB/s21685MB/s   21483MB/s
READ:   19160MB/s18432MB/s   18618MB/s
WRITE:  986276KB/s   1004.2MB/s  981.11MB/

Re: zram: per-cpu compression streams

2016-04-19 Thread Sergey Senozhatsky
Hello Minchan,

On (04/19/16 17:00), Minchan Kim wrote:
> Great!
> 
> So, based on your experiment, the reason I couldn't see such huge win
> in my mahcine is cache size difference(i.e., yours is twice than mine,
> IIRC.) and my perf stat didn't show such big difference.
> If I have a time, I will test it in bigger machine.

quite possible it's due to the cache size.

[..]
> > NOTE:
> > -- fio seems does not attempt to write to device more than disk size, so
> >the test don't include 're-compresion path'.
> 
> I'm convinced now with your data. Super thanks!
> However, as you know, we need data how bad it is in heavy memory pressure.
> Maybe, you can test it with fio and backgound memory hogger,

yeah, sure, will work on it.

> Thanks for the test, Sergey!

thanks!

-ss


Re: zram: per-cpu compression streams

2016-04-19 Thread Minchan Kim
On Mon, Apr 18, 2016 at 04:57:58PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> sorry, it took me so long to return back to testing.
> 
> I collected extended stats (perf), just like you requested.
> - 3G zram, lzo; 4 CPU x86_64 box.
> - fio with perf stat
> 
>   4 streams8 streams   per-cpu
> ===
> #jobs1
> READ:   2520.1MB/s 2566.5MB/s  2491.5MB/s
> READ:   2102.7MB/s 2104.2MB/s  2091.3MB/s
> WRITE:  1355.1MB/s 1320.2MB/s  1378.9MB/s
> WRITE:  1103.5MB/s 1097.2MB/s  1122.5MB/s
> READ:   434013KB/s 435153KB/s  439961KB/s
> WRITE:  433969KB/s 435109KB/s  439917KB/s
> READ:   403166KB/s 405139KB/s  403373KB/s
> WRITE:  403223KB/s 405197KB/s  403430KB/s
> #jobs2
> READ:   7958.6MB/s 8105.6MB/s  8073.7MB/s
> READ:   6864.9MB/s 6989.8MB/s  7021.8MB/s
> WRITE:  2438.1MB/s 2346.9MB/s  3400.2MB/s
> WRITE:  1994.2MB/s 1990.3MB/s  2941.2MB/s
> READ:   981504KB/s 973906KB/s  1018.8MB/s
> WRITE:  981659KB/s 974060KB/s  1018.1MB/s
> READ:   937021KB/s 938976KB/s  987250KB/s
> WRITE:  934878KB/s 936830KB/s  984993KB/s
> #jobs3
> READ:   13280MB/s  13553MB/s   13553MB/s
> READ:   11534MB/s  11785MB/s   11755MB/s
> WRITE:  3456.9MB/s 3469.9MB/s  4810.3MB/s
> WRITE:  3029.6MB/s 3031.6MB/s  4264.8MB/s
> READ:   1363.8MB/s 1362.6MB/s  1448.9MB/s
> WRITE:  1361.9MB/s 1360.7MB/s  1446.9MB/s
> READ:   1309.4MB/s 1310.6MB/s  1397.5MB/s
> WRITE:  1307.4MB/s 1308.5MB/s  1395.3MB/s
> #jobs4
> READ:   20244MB/s  20177MB/s   20344MB/s
> READ:   17886MB/s  17913MB/s   17835MB/s
> WRITE:  4071.6MB/s 4046.1MB/s  6370.2MB/s
> WRITE:  3608.9MB/s 3576.3MB/s  5785.4MB/s
> READ:   1824.3MB/s 1821.6MB/s  1997.5MB/s
> WRITE:  1819.8MB/s 1817.4MB/s  1992.5MB/s
> READ:   1765.7MB/s 1768.3MB/s  1937.3MB/s
> WRITE:  1767.5MB/s 1769.1MB/s  1939.2MB/s
> #jobs5
> READ:   18663MB/s  18986MB/s   18823MB/s
> READ:   16659MB/s  16605MB/s   16954MB/s
> WRITE:  3912.4MB/s 3888.7MB/s  6126.9MB/s
> WRITE:  3506.4MB/s 3442.5MB/s  5519.3MB/s
> READ:   1798.2MB/s 1746.5MB/s  1935.8MB/s
> WRITE:  1792.7MB/s 1740.7MB/s  1929.1MB/s
> READ:   1727.6MB/s 1658.2MB/s  1917.3MB/s
> WRITE:  1726.5MB/s 1657.2MB/s  1916.6MB/s
> #jobs6
> READ:   21017MB/s  20922MB/s   21162MB/s
> READ:   19022MB/s  19140MB/s   18770MB/s
> WRITE:  3968.2MB/s 4037.7MB/s  6620.8MB/s
> WRITE:  3643.5MB/s 3590.2MB/s  6027.5MB/s
> READ:   1871.8MB/s 1880.5MB/s  2049.9MB/s
> WRITE:  1867.8MB/s 1877.2MB/s  2046.2MB/s
> READ:   1755.8MB/s 1710.3MB/s  1964.7MB/s
> WRITE:  1750.5MB/s 1705.9MB/s  1958.8MB/s
> #jobs7
> READ:   21103MB/s  20677MB/s   21482MB/s
> READ:   18522MB/s  18379MB/s   19443MB/s
> WRITE:  4022.5MB/s 4067.4MB/s  6755.9MB/s
> WRITE:  3691.7MB/s 3695.5MB/s  5925.6MB/s
> READ:   1841.5MB/s 1933.9MB/s  2090.5MB/s
> WRITE:  1842.7MB/s 1935.3MB/s  2091.9MB/s
> READ:   1832.4MB/s 1856.4MB/s  1971.5MB/s
> WRITE:  1822.3MB/s 1846.2MB/s  1960.6MB/s
> #jobs8
> READ:   20463MB/s  20194MB/s   20862MB/s
> READ:   18178MB/s  17978MB/s   18299MB/s
> WRITE:  4085.9MB/s 4060.2MB/s  7023.8MB/s
> WRITE:  3776.3MB/s 3737.9MB/s  6278.2MB/s
> READ:   1957.6MB/s 1944.4MB/s  2109.5MB/s
> WRITE:  1959.2MB/s 1946.2MB/s  2111.4MB/s
> READ:   1900.6MB/s 1885.7MB/s  2082.1MB/s
> WRITE:  1896.2MB/s 1881.4MB/s  2078.3MB/s
> #jobs9
> READ:   19692MB/s  19734MB/s   19334MB

Re: zram: per-cpu compression streams

2016-04-18 Thread Sergey Senozhatsky
Hello Minchan,
sorry, it took me so long to return back to testing.

I collected extended stats (perf), just like you requested.
- 3G zram, lzo; 4 CPU x86_64 box.
- fio with perf stat

4 streams8 streams   per-cpu
===
#jobs1  
READ:   2520.1MB/s   2566.5MB/s  2491.5MB/s
READ:   2102.7MB/s   2104.2MB/s  2091.3MB/s
WRITE:  1355.1MB/s   1320.2MB/s  1378.9MB/s
WRITE:  1103.5MB/s   1097.2MB/s  1122.5MB/s
READ:   434013KB/s   435153KB/s  439961KB/s
WRITE:  433969KB/s   435109KB/s  439917KB/s
READ:   403166KB/s   405139KB/s  403373KB/s
WRITE:  403223KB/s   405197KB/s  403430KB/s
#jobs2  
READ:   7958.6MB/s   8105.6MB/s  8073.7MB/s
READ:   6864.9MB/s   6989.8MB/s  7021.8MB/s
WRITE:  2438.1MB/s   2346.9MB/s  3400.2MB/s
WRITE:  1994.2MB/s   1990.3MB/s  2941.2MB/s
READ:   981504KB/s   973906KB/s  1018.8MB/s
WRITE:  981659KB/s   974060KB/s  1018.1MB/s
READ:   937021KB/s   938976KB/s  987250KB/s
WRITE:  934878KB/s   936830KB/s  984993KB/s
#jobs3  
READ:   13280MB/s13553MB/s   13553MB/s
READ:   11534MB/s11785MB/s   11755MB/s
WRITE:  3456.9MB/s   3469.9MB/s  4810.3MB/s
WRITE:  3029.6MB/s   3031.6MB/s  4264.8MB/s
READ:   1363.8MB/s   1362.6MB/s  1448.9MB/s
WRITE:  1361.9MB/s   1360.7MB/s  1446.9MB/s
READ:   1309.4MB/s   1310.6MB/s  1397.5MB/s
WRITE:  1307.4MB/s   1308.5MB/s  1395.3MB/s
#jobs4  
READ:   20244MB/s20177MB/s   20344MB/s
READ:   17886MB/s17913MB/s   17835MB/s
WRITE:  4071.6MB/s   4046.1MB/s  6370.2MB/s
WRITE:  3608.9MB/s   3576.3MB/s  5785.4MB/s
READ:   1824.3MB/s   1821.6MB/s  1997.5MB/s
WRITE:  1819.8MB/s   1817.4MB/s  1992.5MB/s
READ:   1765.7MB/s   1768.3MB/s  1937.3MB/s
WRITE:  1767.5MB/s   1769.1MB/s  1939.2MB/s
#jobs5  
READ:   18663MB/s18986MB/s   18823MB/s
READ:   16659MB/s16605MB/s   16954MB/s
WRITE:  3912.4MB/s   3888.7MB/s  6126.9MB/s
WRITE:  3506.4MB/s   3442.5MB/s  5519.3MB/s
READ:   1798.2MB/s   1746.5MB/s  1935.8MB/s
WRITE:  1792.7MB/s   1740.7MB/s  1929.1MB/s
READ:   1727.6MB/s   1658.2MB/s  1917.3MB/s
WRITE:  1726.5MB/s   1657.2MB/s  1916.6MB/s
#jobs6  
READ:   21017MB/s20922MB/s   21162MB/s
READ:   19022MB/s19140MB/s   18770MB/s
WRITE:  3968.2MB/s   4037.7MB/s  6620.8MB/s
WRITE:  3643.5MB/s   3590.2MB/s  6027.5MB/s
READ:   1871.8MB/s   1880.5MB/s  2049.9MB/s
WRITE:  1867.8MB/s   1877.2MB/s  2046.2MB/s
READ:   1755.8MB/s   1710.3MB/s  1964.7MB/s
WRITE:  1750.5MB/s   1705.9MB/s  1958.8MB/s
#jobs7  
READ:   21103MB/s20677MB/s   21482MB/s
READ:   18522MB/s18379MB/s   19443MB/s
WRITE:  4022.5MB/s   4067.4MB/s  6755.9MB/s
WRITE:  3691.7MB/s   3695.5MB/s  5925.6MB/s
READ:   1841.5MB/s   1933.9MB/s  2090.5MB/s
WRITE:  1842.7MB/s   1935.3MB/s  2091.9MB/s
READ:   1832.4MB/s   1856.4MB/s  1971.5MB/s
WRITE:  1822.3MB/s   1846.2MB/s  1960.6MB/s
#jobs8  
READ:   20463MB/s20194MB/s   20862MB/s
READ:   18178MB/s17978MB/s   18299MB/s
WRITE:  4085.9MB/s   4060.2MB/s  7023.8MB/s
WRITE:  3776.3MB/s   3737.9MB/s  6278.2MB/s
READ:   1957.6MB/s   1944.4MB/s  2109.5MB/s
WRITE:  1959.2MB/s   1946.2MB/s  2111.4MB/s
READ:   1900.6MB/s   1885.7MB/s  2082.1MB/s
WRITE:  1896.2MB/s   1881.4MB/s  2078.3MB/s
#jobs9  
READ:   19692MB/s19734MB/s   19334MB/s
READ:   17678MB/s18249MB/s   17666MB/s
WRITE:  4004.7MB/s   4064.8MB/s  6990.7MB/s
WRITE:  3724.7MB/s   3

Re: zram: per-cpu compression streams

2016-04-03 Thread Sergey Senozhatsky
Hello Minchan,

On (04/04/16 09:27), Minchan Kim wrote:
> Hello Sergey,
> 
> On Sat, Apr 02, 2016 at 12:38:29AM +0900, Sergey Senozhatsky wrote:
> > Hello Minchan,
> > 
> > On (03/31/16 15:34), Sergey Senozhatsky wrote:
> > > > I tested with you suggested parameter.
> > > > In my side, win is better compared to my previous test but it seems
> > > > your test is so fast. IOW, filesize is small and loops is just 1.
> > > > Please test filesize=500m loops=10 or 20.
> > 
> > fio
> > - loops=10
> > - buffer_pattern=0xbadc0ffee
> > 
> > zram 6G. no intel p-state, deadline IO scheduler, no lockdep (no lock 
> > debugging).
> 
> We are using rw_page so I/O scheduler is not related.

yes, agree. added it just in case.

> Anyway, I configured my machine as you said but still see 10~20% enhance. :(
> Hmm, could you post your .config?

oh, sorry, completely forgot about it. attached.

> I want to investigate why such difference happens between our machines.
> 
> The reason I want to see such *big enhance* in my machine is that
> as you know, with per-cpu, zram's write path will lose blockable section
> so it would make upcoming features's implementation hard.

I see. well, depending on what new features are about to come in, we
can utilize the same per-cpu mechanism if we are talking about some
sort of buffers, streams, etc.

> We also should test it in very low memory situation so every write path
> retry it(i.e., dobule compression). With it, I want to see how many
> performance can drop.

one of the boxen I use has only 4G of memory, so "re-compressions" do
happen there. I can add a simple counter (just for testing purposes)
to see how often.

> If both test(normal: huge win low memory: small regression) are fine,
> we can go per-cpu approach at the cost of giving up blockable section.
> :)

yep.

-ss
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.6.0-rc1 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-dbg"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="swordfish"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accountin

Re: zram: per-cpu compression streams

2016-04-03 Thread Minchan Kim
Hello Sergey,

On Sat, Apr 02, 2016 at 12:38:29AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/31/16 15:34), Sergey Senozhatsky wrote:
> > > I tested with you suggested parameter.
> > > In my side, win is better compared to my previous test but it seems
> > > your test is so fast. IOW, filesize is small and loops is just 1.
> > > Please test filesize=500m loops=10 or 20.
> 
> fio
> - loops=10
> - buffer_pattern=0xbadc0ffee
> 
> zram 6G. no intel p-state, deadline IO scheduler, no lockdep (no lock 
> debugging).

We are using rw_page so I/O scheduler is not related.
Anyway, I configured my machine as you said but still see 10~20% enhance. :(
Hmm, could you post your .config?
I want to investigate why such difference happens between our machines.

The reason I want to see such *big enhance* in my machine is that
as you know, with per-cpu, zram's write path will lose blockable section
so it would make upcoming features's implementation hard.

We also should test it in very low memory situation so every write path
retry it(i.e., dobule compression). With it, I want to see how many
performance can drop.

If both test(normal: huge win low memory: small regression) are fine,
we can go per-cpu approach at the cost of giving up blockable section.
:)

Thanks.


> 
> 
> test8 streamsper-cpu
> 
> #jobs1
> READ:   4118.2MB/s 4105.3MB/s
> READ:   3487.7MB/s 3624.9MB/s
> WRITE:  2197.8MB/s 2305.1MB/s
> WRITE:  1776.2MB/s 1887.5MB/s
> READ:   736589KB/s 745648KB/s

> WRITE:  736353KB/s 745409KB/s
> READ:   679279KB/s 686559KB/s
> WRITE:  679093KB/s 686371KB/s
> #jobs2
> READ:   6924.6MB/s 7160.2MB/s
> READ:   6213.2MB/s 6247.1MB/s
> WRITE:  2510.3MB/s 3680.1MB/s
> WRITE:  2286.2MB/s 3153.9MB/s
> READ:   1163.1MB/s 1333.7MB/s
> WRITE:  1163.4MB/s 1332.2MB/s
> READ:   1122.9MB/s 1240.3MB/s
> WRITE:  1121.9MB/s 1239.2MB/s
> #jobs3
> READ:   10304MB/s  10424MB/s
> READ:   9014.5MB/s 9014.5MB/s
> WRITE:  3883.9MB/s 5373.8MB/s
> WRITE:  3549.1MB/s 4576.4MB/s
> READ:   1704.4MB/s 1916.8MB/s
> WRITE:  1704.9MB/s 1915.9MB/s
> READ:   1603.5MB/s 1806.8MB/s
> WRITE:  1598.8MB/s 1800.8MB/s
> #jobs4
> READ:   13509MB/s  12792MB/s
> READ:   10899MB/s  11434MB/s
> WRITE:  4027.2MB/s 6272.8MB/s
> WRITE:  3902.1MB/s 5389.2MB/s
> READ:   2090.9MB/s 2344.4MB/s
> WRITE:  2085.2MB/s 2337.1MB/s
> READ:   1968.1MB/s 2185.9MB/s
> WRITE:  1969.5MB/s 2186.4MB/s
> #jobs5
> READ:   12634MB/s  11607MB/s
> READ:   9932.7MB/s 9980.6MB/s
> WRITE:  4275.8MB/s 5844.3MB/s
> WRITE:  4210.1MB/s 5262.3MB/s
> READ:   1995.6MB/s 2211.4MB/s
> WRITE:  1988.4MB/s 2203.4MB/s
> READ:   1930.1MB/s 2191.8MB/s
> WRITE:  1929.8MB/s 2190.3MB/s
> #jobs6
> READ:   12270MB/s  13012MB/s
> READ:   11221MB/s  10815MB/s
> WRITE:  4643.4MB/s 6090.9MB/s
> WRITE:  4373.6MB/s 5772.8MB/s
> READ:   2232.6MB/s 2358.4MB/s
> WRITE:  2233.4MB/s 2359.2MB/s
> READ:   2082.6MB/s 2285.8MB/s
> WRITE:  2075.9MB/s 2278.1MB/s
> #jobs7
> READ:   13617MB/s  14172MB/s
> READ:   12290MB/s  11734MB/s
> WRITE:  5077.3MB/s 6315.7MB/s
> WRITE:  4719.4MB/s 5825.1MB/s
> READ:   2379.8MB/s 2523.7MB/s
> WRITE:  2373.7MB/s 2516.7MB/s
> READ:   2287.9MB/s 2362.4MB/s
> WRITE:  2283.9MB/s 2358.2MB/s
> #jobs8
> READ:   15130MB/s  15533MB/s
> READ:   12952MB/s  13077MB/s
> WRITE:  5586.6MB/s 7108.2MB/s
> WRITE:  5233.5MB/s 6591.3MB/s
> READ:   2541.2MB/s 2709.2MB/s
> WRITE:  2544.6MB/s 2713.2MB/s
> READ:   2450.6MB/s 2590.7MB/s
> WRITE:  2449.4MB/s 2589.3MB/s
> #jobs9
> READ:   13480MB/s  13909MB/s
> READ:   12389MB/s  12000MB/s
> WRITE:  5266.8MB/s 6594.9MB/s
> WRITE:  4971.6MB/s 6442.2MB/s
> READ:   2464.9MB/s 2470.9MB/s
> WRITE:  2482.7MB/s 2488.8MB/s
> READ:   2171.9MB/s 2402.2MB/s
> WRITE:  2174.9MB/s

Re: zram: per-cpu compression streams

2016-04-01 Thread Sergey Senozhatsky
Hello Minchan,

On (03/31/16 15:34), Sergey Senozhatsky wrote:
> > I tested with you suggested parameter.
> > In my side, win is better compared to my previous test but it seems
> > your test is so fast. IOW, filesize is small and loops is just 1.
> > Please test filesize=500m loops=10 or 20.

fio
- loops=10
- buffer_pattern=0xbadc0ffee

zram 6G. no intel p-state, deadline IO scheduler, no lockdep (no lock 
debugging).


test8 streamsper-cpu

#jobs1  
READ:   4118.2MB/s   4105.3MB/s
READ:   3487.7MB/s   3624.9MB/s
WRITE:  2197.8MB/s   2305.1MB/s
WRITE:  1776.2MB/s   1887.5MB/s
READ:   736589KB/s   745648KB/s
WRITE:  736353KB/s   745409KB/s
READ:   679279KB/s   686559KB/s
WRITE:  679093KB/s   686371KB/s
#jobs2  
READ:   6924.6MB/s   7160.2MB/s
READ:   6213.2MB/s   6247.1MB/s
WRITE:  2510.3MB/s   3680.1MB/s
WRITE:  2286.2MB/s   3153.9MB/s
READ:   1163.1MB/s   1333.7MB/s
WRITE:  1163.4MB/s   1332.2MB/s
READ:   1122.9MB/s   1240.3MB/s
WRITE:  1121.9MB/s   1239.2MB/s
#jobs3  
READ:   10304MB/s10424MB/s
READ:   9014.5MB/s   9014.5MB/s
WRITE:  3883.9MB/s   5373.8MB/s
WRITE:  3549.1MB/s   4576.4MB/s
READ:   1704.4MB/s   1916.8MB/s
WRITE:  1704.9MB/s   1915.9MB/s
READ:   1603.5MB/s   1806.8MB/s
WRITE:  1598.8MB/s   1800.8MB/s
#jobs4  
READ:   13509MB/s12792MB/s
READ:   10899MB/s11434MB/s
WRITE:  4027.2MB/s   6272.8MB/s
WRITE:  3902.1MB/s   5389.2MB/s
READ:   2090.9MB/s   2344.4MB/s
WRITE:  2085.2MB/s   2337.1MB/s
READ:   1968.1MB/s   2185.9MB/s
WRITE:  1969.5MB/s   2186.4MB/s
#jobs5  
READ:   12634MB/s11607MB/s
READ:   9932.7MB/s   9980.6MB/s
WRITE:  4275.8MB/s   5844.3MB/s
WRITE:  4210.1MB/s   5262.3MB/s
READ:   1995.6MB/s   2211.4MB/s
WRITE:  1988.4MB/s   2203.4MB/s
READ:   1930.1MB/s   2191.8MB/s
WRITE:  1929.8MB/s   2190.3MB/s
#jobs6  
READ:   12270MB/s13012MB/s
READ:   11221MB/s10815MB/s
WRITE:  4643.4MB/s   6090.9MB/s
WRITE:  4373.6MB/s   5772.8MB/s
READ:   2232.6MB/s   2358.4MB/s
WRITE:  2233.4MB/s   2359.2MB/s
READ:   2082.6MB/s   2285.8MB/s
WRITE:  2075.9MB/s   2278.1MB/s
#jobs7  
READ:   13617MB/s14172MB/s
READ:   12290MB/s11734MB/s
WRITE:  5077.3MB/s   6315.7MB/s
WRITE:  4719.4MB/s   5825.1MB/s
READ:   2379.8MB/s   2523.7MB/s
WRITE:  2373.7MB/s   2516.7MB/s
READ:   2287.9MB/s   2362.4MB/s
WRITE:  2283.9MB/s   2358.2MB/s
#jobs8  
READ:   15130MB/s15533MB/s
READ:   12952MB/s13077MB/s
WRITE:  5586.6MB/s   7108.2MB/s
WRITE:  5233.5MB/s   6591.3MB/s
READ:   2541.2MB/s   2709.2MB/s
WRITE:  2544.6MB/s   2713.2MB/s
READ:   2450.6MB/s   2590.7MB/s
WRITE:  2449.4MB/s   2589.3MB/s
#jobs9  
READ:   13480MB/s13909MB/s
READ:   12389MB/s12000MB/s
WRITE:  5266.8MB/s   6594.9MB/s
WRITE:  4971.6MB/s   6442.2MB/s
READ:   2464.9MB/s   2470.9MB/s
WRITE:  2482.7MB/s   2488.8MB/s
READ:   2171.9MB/s   2402.2MB/s
WRITE:  2174.9MB/s   2405.5MB/s
#jobs10 
READ:   14647MB/s14667MB/s
READ:   11765MB/s12032MB/s
WRITE:  5248.7MB/s   6740.4MB/s
WRITE:  4779.8MB/s   5822.8MB/s
READ:   2448.8MB/s   2585.3MB/s
WRITE:  2449.4MB/s   2585.9MB/s
READ:   2290.5MB/s   2409.1MB/s
WRITE:  2290.2MB/s   2409.7MB/s

-ss


Re: zram: per-cpu compression streams

2016-03-30 Thread Sergey Senozhatsky
Hello Minchan,

On (03/31/16 14:53), Minchan Kim wrote:
> Hello Sergey,
>
> > that's a good question. I quickly looked into the fio source code,
> > we need to use "buffer_pattern=str" option, I think. so the buffers
> > will be filled with the same data.
> > 
> > I don't mind to have buffer_compress_percentage as a separate test (set
> > as a local test option), but I think that using common buffer pattern
> > adds more confidence when we compare test results.
> 
> If we both uses same "buffer_compress_percentage=something", it's
> good to compare. The benefit of buffer_compress_percentage is we can
> change compression ratio easily in zram testing and see various
> test to see what compression ratio or speed affects the system.

let's start with "common data" (buffer_pattern=str), not common
compression ratio. buffer_compress_percentage=something is calculated
for which compression algorithm? deflate (zlib)? or it's something else?
we use lzo/lz4, common data is more predictable.

[..]
> > sure.
> 
> I tested with you suggested parameter.
> In my side, win is better compared to my previous test but it seems
> your test is so fast. IOW, filesize is small and loops is just 1.
> Please test filesize=500m loops=10 or 20.

that will require 5G zram, I don't have that much ram on the box so I'll
test later today on another box.

I split the device size between jobs. if I have 10 jobs, then the file
size of each job is DISK_SIZE/10; but in total jobs write/read DEVICE_SZ
bytes. jobs start with large 1 * DEVICE_SZ/1 files and go down to
10 * DEVICE_SZ/10 files.

> It can make your test more stable and enhance is 10~20% in my side.
> Let's discuss further once test result between us is consistent.

-ss


Re: zram: per-cpu compression streams

2016-03-30 Thread Minchan Kim
Hello Sergey,

On Thu, Mar 31, 2016 at 10:26:26AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/31/16 07:12), Minchan Kim wrote:
> [..]
> > > I used a bit different script. no `buffer_compress_percentage' option,
> > > because it provide "a mix of random data and zeroes"
> > 
> > Normally, zram's compression ratio is 3 or 2 so I used it.
> > Hmm, isn't it more real practice usecase?
> 
> this option guarantees that the supplied to zram data will have
> a requested compression ratio? hm, but we never do that in real
> life, zram sees random data.

I agree it's hard to create such random read data with benchmark.
One option is that we share swap dump data of real product, for exmaple,
android or webOS and feed it to the benchmark. But as you know, it
cannot cover all of workload, either. So, to just easy test, I wanted
to make represntative compression ratio data and fio provides option
for it via buffer_compress_percentage.
It would be better rather than feeding random data which could make
lots of noise for each test cycle.

> 
> > If we don't use buffer_compress_percentage, what's the content in the 
> > buffer?
> 
> that's a good question. I quickly looked into the fio source code,
> we need to use "buffer_pattern=str" option, I think. so the buffers
> will be filled with the same data.
> 
> I don't mind to have buffer_compress_percentage as a separate test (set
> as a local test option), but I think that using common buffer pattern
> adds more confidence when we compare test results.

If we both uses same "buffer_compress_percentage=something", it's
good to compare. The benefit of buffer_compress_percentage is we can
change compression ratio easily in zram testing and see various
test to see what compression ratio or speed affects the system.

>  
> [..]
> > > hm, but I guess it's not enough; fio probably will have different
> > > data (well, only if we didn't ask it to zero-fill the buffers) for
> > > different tests, causing different zram->zsmalloc behaviour. need
> > > to check it.
> [..]
> > > #jobs4   
> > > READ:  8720.4MB/s  7301.7MB/s  7896.2MB/s
> > > READ:  7510.3MB/s  6690.1MB/s  6456.2MB/s
> > > WRITE: 2211.6MB/s  1930.8MB/s  2713.9MB/s
> > > WRITE: 2002.2MB/s  1629.8MB/s  2227.7MB/s
> > 
> > Your case is 40% win. It's huge, Nice!
> > I tested with your guide line(i.e., no buffer_compress_percentage,
> > scramble_buffers=0) but still 10% enhance in my machine.
> > Hmm,,,
> > 
> > How about if you test my fio job.file in your machine?
> > Still, it's 40% win?
> 
> I'll retest with new config.
> 
> > Also, I want to test again in your exactly same configuration.
> > Could you tell me zram environment(ie, disksize, compression
> > algorithm) and share me your job.file of fio?
> 
> sure.

I tested with you suggested parameter.
In my side, win is better compared to my previous test but it seems
your test is so fast. IOW, filesize is small and loops is just 1.
Please test filesize=500m loops=10 or 20.
It can make your test more stable and enhance is 10~20% in my side.
Let's discuss further once test result between us is consistent.

Thanks.

> 
> 3G, lzo
> 
> 
> --- my fio-template is
> 
> [global]
> bs=4k
> ioengine=sync
> direct=1
> size=__SIZE__
> numjobs=__JOBS__
> group_reporting
> filename=/dev/zram0
> loops=1
> buffer_pattern=0xbadc0ffee
> scramble_buffers=0
> 
> [seq-read]
> rw=read
> stonewall
> 
> [rand-read]
> rw=randread
> stonewall
> 
> [seq-write]
> rw=write
> stonewall
> 
> [rand-write]
> rw=randwrite
> stonewall
> 
> [mixed-seq]
> rw=rw
> stonewall
> 
> [mixed-rand]
> rw=randrw
> stonewall
> 
> 
> #separate test with
> #buffer_compress_percentage=50
> 
> 
> 
> --- my create-zram script is as follows.
> 
> 
> #!/bin/sh
> 
> rmmod zram
> modprobe zram
> 
> if [ -e /sys/block/zram0/initstate ]; then
> initdone=`cat /sys/block/zram0/initstate`
> if [ $initdone = 1 ]; then
> echo "init done"
> exit 1
> fi
> fi
> 
> echo 8 > /sys/block/zram0/max_comp_streams
> 
> echo lzo > /sys/block/zram0/comp_algorithm
> cat /sys/block/zram0/comp_algorithm
> 
> cat /sys/block/zram0/max_comp_streams
> echo $1 > /sys/block/zram0/disksize
> 
> 
> 
> 
> 
> --- and I use it as
> 
> 
> #!/bin/sh
> 
> DEVICE_SZ=$((3 * 1024 * 1024 * 1024))
> FREE_SPACE=$(($DEVICE_SZ / 10))
> LOG=/tmp/fio-zram-test
> LOG_SUFFIX=$1
> 
> function reset_zram
> {
> rmmod zram
> }
> 
> function create_zram
> {
> ./create-zram $DEVICE_SZ
> }
> 
> function main
> {
> local j
> local i
> 
> if [ "z$LOG_SUFFIX" = "z" ]; then
> LOG_SUFFIX="UNSET"
> fi
> 
> LOG=$LOG-$LOG_SUFFIX
> 
> for i in {1..10}; do
> reset_zram
> create_zram
> 
> cat fio-test-template | sed s/__JOBS__/$i/ | sed 
> s/__SIZE__/$((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M/ > fio-test
> 

Re: zram: per-cpu compression streams

2016-03-30 Thread Sergey Senozhatsky
Hello,

On (03/31/16 07:12), Minchan Kim wrote:
[..]
> > I used a bit different script. no `buffer_compress_percentage' option,
> > because it provide "a mix of random data and zeroes"
> 
> Normally, zram's compression ratio is 3 or 2 so I used it.
> Hmm, isn't it more real practice usecase?

this option guarantees that the supplied to zram data will have
a requested compression ratio? hm, but we never do that in real
life, zram sees random data.

> If we don't use buffer_compress_percentage, what's the content in the buffer?

that's a good question. I quickly looked into the fio source code,
we need to use "buffer_pattern=str" option, I think. so the buffers
will be filled with the same data.

I don't mind to have buffer_compress_percentage as a separate test (set
as a local test option), but I think that using common buffer pattern
adds more confidence when we compare test results.

[..]
> > hm, but I guess it's not enough; fio probably will have different
> > data (well, only if we didn't ask it to zero-fill the buffers) for
> > different tests, causing different zram->zsmalloc behaviour. need
> > to check it.
[..]
> > #jobs4 
> > READ:  8720.4MB/s7301.7MB/s  7896.2MB/s
> > READ:  7510.3MB/s6690.1MB/s  6456.2MB/s
> > WRITE: 2211.6MB/s1930.8MB/s  2713.9MB/s
> > WRITE: 2002.2MB/s1629.8MB/s  2227.7MB/s
> 
> Your case is 40% win. It's huge, Nice!
> I tested with your guide line(i.e., no buffer_compress_percentage,
> scramble_buffers=0) but still 10% enhance in my machine.
> Hmm,,,
> 
> How about if you test my fio job.file in your machine?
> Still, it's 40% win?

I'll retest with new config.

> Also, I want to test again in your exactly same configuration.
> Could you tell me zram environment(ie, disksize, compression
> algorithm) and share me your job.file of fio?

sure.

3G, lzo


--- my fio-template is

[global]
bs=4k
ioengine=sync
direct=1
size=__SIZE__
numjobs=__JOBS__
group_reporting
filename=/dev/zram0
loops=1
buffer_pattern=0xbadc0ffee
scramble_buffers=0

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

[mixed-seq]
rw=rw
stonewall

[mixed-rand]
rw=randrw
stonewall


#separate test with
#buffer_compress_percentage=50



--- my create-zram script is as follows.


#!/bin/sh

rmmod zram
modprobe zram

if [ -e /sys/block/zram0/initstate ]; then
initdone=`cat /sys/block/zram0/initstate`
if [ $initdone = 1 ]; then
echo "init done"
exit 1
fi
fi

echo 8 > /sys/block/zram0/max_comp_streams

echo lzo > /sys/block/zram0/comp_algorithm
cat /sys/block/zram0/comp_algorithm

cat /sys/block/zram0/max_comp_streams
echo $1 > /sys/block/zram0/disksize





--- and I use it as


#!/bin/sh

DEVICE_SZ=$((3 * 1024 * 1024 * 1024))
FREE_SPACE=$(($DEVICE_SZ / 10))
LOG=/tmp/fio-zram-test
LOG_SUFFIX=$1

function reset_zram
{
rmmod zram
}

function create_zram
{
./create-zram $DEVICE_SZ
}

function main
{
local j
local i

if [ "z$LOG_SUFFIX" = "z" ]; then
LOG_SUFFIX="UNSET"
fi

LOG=$LOG-$LOG_SUFFIX

for i in {1..10}; do
reset_zram
create_zram

cat fio-test-template | sed s/__JOBS__/$i/ | sed 
s/__SIZE__/$((($DEVICE_SZ/$i - $FREE_SPACE)/(1024*1024)))M/ > fio-test
echo "#jobs$i" >> $LOG
time fio ./fio-test >> $LOG
done

reset_zram
}

main




-- then I use this simple script

#!/bin/sh

if [ "z$2" = "z" ]; then
cat $1 | egrep "#jobs|READ|WRITE" | awk '{printf "%-15s %15s\n", $1, 
$3}' | sed s/aggrb=// | sed s/,//
else
cat $1 | egrep "#jobs|READ|WRITE" | awk '{printf " %-15s\n", $3}' | sed 
s/aggrb=// | sed s/\#jobs[0-9]*// | sed s/,//
fi




as 

./squeeze.sh fio-zram-test-4-stream > 4s
./squeeze.sh fio-zram-test-8-stream A > 8s
./squeeze.sh fio-zram-test-per-cpu A > pc

and

paste 4s 8s pc > result


-ss


Re: zram: per-cpu compression streams

2016-03-30 Thread Minchan Kim
On Wed, Mar 30, 2016 at 05:34:19PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> sorry for long reply.
> 
> On (03/28/16 12:21), Minchan Kim wrote:
> [..]
> > group_reporting
> > buffer_compress_percentage=50
> > filename=/dev/zram0
> > loops=10
> 
> I used a bit different script. no `buffer_compress_percentage' option,
> because it provide "a mix of random data and zeroes"

Normally, zram's compression ratio is 3 or 2 so I used it.
Hmm, isn't it more real practice usecase?
If we don't use buffer_compress_percentage, what's the content in the buffer?

> 
> buffer_compress_percentage=int
> If this is set, then fio will attempt to provide IO buffer content
> (on WRITEs) that compress to the specified level. Fio does this by
> providing a mix of random data and zeroes
> 
> and I also used scramble_buffers=0. but default scramble_buffers is
> true, so
> 
> scramble_buffers=bool
> If refill_buffers is too costly and the target is using data
> deduplication, then setting this option will slightly modify the IO
> buffer contents to defeat normal de-dupe attempts. This is not
> enough to defeat more clever block compression attempts, but it will
> stop naive dedupe of blocks. Default: true.
> 
> hm, but I guess it's not enough; fio probably will have different
> data (well, only if we didn't ask it to zero-fill the buffers) for
> different tests, causing different zram->zsmalloc behaviour. need
> to check it.
> 
> 
> > Hmm, Could you retest to who how the benefit is big?
> 
> sure. the results are:
> 
> - seq-read
> - rand-read
> - seq-write
> - rand-write  (READ + WRITE)
> - mixed-seq
> - mixed-rand  (READ + WRITE)
> 
> TEST4 streams 8 streams   per-cpu
> 
> #jobs1   
> READ:  2665.4MB/s  2515.2MB/s  2632.4MB/s
> READ:  2258.2MB/s  2055.2MB/s  2166.2MB/s
> WRITE: 933180KB/s  894260KB/s  898234KB/s
> WRITE: 765576KB/s  728154KB/s  746396KB/s
> READ:  563169KB/s  541004KB/s  551541KB/s
> WRITE: 562660KB/s  540515KB/s  551043KB/s
> READ:  493656KB/s  477990KB/s  488041KB/s
> WRITE: 493210KB/s  477558KB/s  487600KB/s
> #jobs2   
> READ:  5116.7MB/s  4607.1MB/s  4401.5MB/s
> READ:  4401.5MB/s  3993.6MB/s  3831.6MB/s
> WRITE: 1539.9MB/s  1425.5MB/s  1600.0MB/s
> WRITE: 1311.1MB/s  1228.7MB/s  1380.6MB/s
> READ:  1001.8MB/s  960799KB/s  989.63MB/s
> WRITE: 998.31MB/s  957540KB/s  986.26MB/s
> READ:  921439KB/s  860387KB/s  899720KB/s
> WRITE: 918314KB/s  857469KB/s  896668KB/s
> #jobs3   
> READ:  6670.9MB/s  6469.9MB/s  6548.8MB/s
> READ:  5743.4MB/s  5507.8MB/s  5608.4MB/s
> WRITE: 1923.8MB/s  1885.9MB/s  2191.9MB/s
> WRITE: 1622.4MB/s  1605.4MB/s  1842.2MB/s
> READ:  1277.3MB/s  1295.8MB/s  1395.2MB/s
> WRITE: 1276.9MB/s  1295.4MB/s  1394.7MB/s
> READ:  1152.6MB/s  1137.1MB/s  1216.6MB/s
> WRITE: 1152.2MB/s  1137.6MB/s  1216.2MB/s
> #jobs4   
> READ:  8720.4MB/s  7301.7MB/s  7896.2MB/s
> READ:  7510.3MB/s  6690.1MB/s  6456.2MB/s
> WRITE: 2211.6MB/s  1930.8MB/s  2713.9MB/s
> WRITE: 2002.2MB/s  1629.8MB/s  2227.7MB/s

Your case is 40% win. It's huge, Nice!
I tested with your guide line(i.e., no buffer_compress_percentage,
scramble_buffers=0) but still 10% enhance in my machine.
Hmm,,,

How about if you test my fio job.file in your machine?
Still, it's 40% win?

Also, I want to test again in your exactly same configuration.
Could you tell me zram environment(ie, disksize, compression
algorithm) and share me your job.file of fio?


Thanks.


Re: zram: per-cpu compression streams

2016-03-30 Thread Sergey Senozhatsky
Hello Minchan,
sorry for long reply.

On (03/28/16 12:21), Minchan Kim wrote:
[..]
> group_reporting
> buffer_compress_percentage=50
> filename=/dev/zram0
> loops=10

I used a bit different script. no `buffer_compress_percentage' option,
because it provide "a mix of random data and zeroes"

buffer_compress_percentage=int
If this is set, then fio will attempt to provide IO buffer content
(on WRITEs) that compress to the specified level. Fio does this by
providing a mix of random data and zeroes

and I also used scramble_buffers=0. but default scramble_buffers is
true, so

scramble_buffers=bool
If refill_buffers is too costly and the target is using data
deduplication, then setting this option will slightly modify the IO
buffer contents to defeat normal de-dupe attempts. This is not
enough to defeat more clever block compression attempts, but it will
stop naive dedupe of blocks. Default: true.

hm, but I guess it's not enough; fio probably will have different
data (well, only if we didn't ask it to zero-fill the buffers) for
different tests, causing different zram->zsmalloc behaviour. need
to check it.


> Hmm, Could you retest to who how the benefit is big?

sure. the results are:

- seq-read
- rand-read
- seq-write
- rand-write  (READ + WRITE)
- mixed-seq
- mixed-rand  (READ + WRITE)

TEST4 streams 8 streams   per-cpu

#jobs1 
READ:  2665.4MB/s2515.2MB/s  2632.4MB/s
READ:  2258.2MB/s2055.2MB/s  2166.2MB/s
WRITE: 933180KB/s894260KB/s  898234KB/s
WRITE: 765576KB/s728154KB/s  746396KB/s
READ:  563169KB/s541004KB/s  551541KB/s
WRITE: 562660KB/s540515KB/s  551043KB/s
READ:  493656KB/s477990KB/s  488041KB/s
WRITE: 493210KB/s477558KB/s  487600KB/s
#jobs2 
READ:  5116.7MB/s4607.1MB/s  4401.5MB/s
READ:  4401.5MB/s3993.6MB/s  3831.6MB/s
WRITE: 1539.9MB/s1425.5MB/s  1600.0MB/s
WRITE: 1311.1MB/s1228.7MB/s  1380.6MB/s
READ:  1001.8MB/s960799KB/s  989.63MB/s
WRITE: 998.31MB/s957540KB/s  986.26MB/s
READ:  921439KB/s860387KB/s  899720KB/s
WRITE: 918314KB/s857469KB/s  896668KB/s
#jobs3 
READ:  6670.9MB/s6469.9MB/s  6548.8MB/s
READ:  5743.4MB/s5507.8MB/s  5608.4MB/s
WRITE: 1923.8MB/s1885.9MB/s  2191.9MB/s
WRITE: 1622.4MB/s1605.4MB/s  1842.2MB/s
READ:  1277.3MB/s1295.8MB/s  1395.2MB/s
WRITE: 1276.9MB/s1295.4MB/s  1394.7MB/s
READ:  1152.6MB/s1137.1MB/s  1216.6MB/s
WRITE: 1152.2MB/s1137.6MB/s  1216.2MB/s
#jobs4 
READ:  8720.4MB/s7301.7MB/s  7896.2MB/s
READ:  7510.3MB/s6690.1MB/s  6456.2MB/s
WRITE: 2211.6MB/s1930.8MB/s  2713.9MB/s
WRITE: 2002.2MB/s1629.8MB/s  2227.7MB/s
READ:  1657.8MB/s1437.1MB/s  1765.8MB/s
WRITE: 1651.7MB/s1432.7MB/s  1759.3MB/s
READ:  1467.7MB/s1201.7MB/s  1523.5MB/s
WRITE: 1462.3MB/s1197.3MB/s  1517.9MB/s
#jobs5 
READ:  7791.9MB/s6852.7MB/s  7487.9MB/s
READ:  6214.6MB/s6449.6MB/s  7106.5MB/s
WRITE: 2017.9MB/s1978.1MB/s  2221.5MB/s
WRITE: 1913.1MB/s1664.9MB/s  1985.8MB/s
READ:  1417.6MB/s1447.7MB/s  1558.8MB/s
WRITE: 1419.8MB/s1449.3MB/s  1561.2MB/s
READ:  1336.9MB/s1234.1MB/s  1404.7MB/s
WRITE: 1338.2MB/s1236.9MB/s  1406.8MB/s
#jobs6 
READ:  8680.9MB/s8500.0MB/s  7116.3MB/s
READ:  7329.4MB/s6580.7MB/s  6476.2MB/s
WRITE: 2121.4MB/s1918.6MB/s  2472.8MB/s
WRITE: 1936.8MB/s1826.9MB/s  2106.8MB/s
READ:  1559.9MB/s1506.3MB/s  1643.6MB/s
WRITE: 1554.7MB/s1501.2MB/s  1637.2MB/s
READ:  1459.7MB/s1258.9MB/s  1502.6MB/s
WRITE: 1454.8MB/s1254.6MB/s  1497.5MB/s
#jobs7 
READ:  9170.0MB/s7905.2MB/s  8043.9MB/s
READ:  6412.7MB/s6792.7MB/s  6457.8MB/s
WRITE: 2042.4MB/s1972.5MB/s  2400.6MB/s
WRITE: 1938.8MB/s1808.7MB/s  2152.6MB/s
READ:  1634.9MB/s1505.8MB/s  1746.4MB/s
WRITE: 1640.1MB/s1511.4MB/s  1753.7MB/s
READ:  1407.9MB/s1239.1MB/s  1480.8MB/s
WRITE: 1413.8MB/s1245.2MB/s  1486.1MB/s
#jobs8 
READ:  8563.4MB/s8106.7MB/s  7696.3MB/s
READ:  6909.1MB/s5790.5MB/s  6537.7MB/s
WRITE: 2040.3MB/s2061.2MB/s  2481.7MB/s
WRITE: 1993.5MB/s1859.4MB/s  2171.5MB/s
READ:  1691.6MB/s1585.8MB/s  1749.9MB/s
WRITE: 1686.3MB/s 

Re: zram: per-cpu compression streams

2016-03-27 Thread Minchan Kim
Hi Sergey,

On Fri, Mar 25, 2016 at 10:47:06AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/25/16 08:41), Minchan Kim wrote:
> [..]
> > >  Test #10 iozone -t 10 -R -r 80K -s 0M -I +Z
> > >Initial write3213973.56  2731512.62  4416466.25*
> > >  Rewrite3066956.44* 2693819.50   332671.94
> > > Read7769523.25* 2681473.75   462840.44
> > >  Re-read5244861.75  5473037.00*  382183.03
> > > Reverse Read7479397.25* 4869597.75   374714.06
> > >  Stride read5403282.50* 5385083.75   382473.44
> > >  Random read5131997.25  5176799.75*  380593.56
> > >   Mixed workload3998043.25  4219049.00* 1645850.45
> > > Random write3452832.88  3290861.69  3588531.75*
> > >   Pwrite3757435.81  2711756.47  4561807.88*
> > >Pread2743595.25* 2635835.00   412947.98
> > >   Fwrite   16076549.00 16741977.25*14797209.38
> > >Fread   23581812.62*21664184.25  5064296.97
> > >  =  real 0m44.490s   0m44.444s   0m44.609s
> > >  =  user  0m0.054s0m0.049s0m0.055s
> > >  =   sys  0m0.037s0m0.046s0m0.148s
> > >  
> > >  
> > >  so when the number of active tasks become larger than the number
> > >  of online CPUS, iozone reports a bit hard to understand data. I
> > >  can assume that since now we keep the preemption disabled longer
> > >  in write path, a concurrent operation (READ or WRITE) cannot preempt
> > >  current anymore... slightly suspicious.
> > >  
> > >  the other hard to understand thing is why do READ-only tests have
> > >  such a huge jitter. READ-only tests don't depend on streams, they
> > >  don't even use them, we supply compressed data directly to
> > >  decompression api.
> > >  
> > >  may be better retire iozone and never use it again.
> > >  
> > >  
> > >  "118 insertions(+), 238 deletions(-)" the patches remove a big
> > >  pile of code.
> > 
> > First of all, I appreciate you very much!
> 
> thanks!
> 
> > At a glance, on write workload, huge win but worth to investigate
> > how such fluctuation/regression happens on read-related test
> > (read and mixed workload).
> 
> yes, was going to investigate in more details but got interrupted,
> will return back to it today/tomorrow.
> 
> > Could you send your patchset? I will test it.
> 
> oh, sorry, sure! attached (because it's not a real patch submission
> yet, but they look more or less ready I guess).
> 
> patches are against next-20160324.

Thanks, I tested your patch with fio.
My laptop is 8G ram, 4 CPU.
job file is here.

= 
[global]
bs=4k
ioengine=sync
direct=1
size=100m
numjobs=${NUMJOBS}
group_reporting
buffer_compress_percentage=50
filename=/dev/zram0
loops=10

[seq-read]
rw=read
stonewall

[rand-read]
rw=randread
stonewall

[seq-write]
rw=write
stonewall

[rand-write]
rw=randwrite
stonewall

[mixed-seq]
rw=rw
stonewall

[mixed-rand]
rw=randrw
stonewall
=

= old(ie, spinlock) version =

1) NR_PROCESS:8 NR_STREAM: 1

seq-read: (groupid=0, jobs=8): err= 0: pid=23148: Mon Mar 28 12:07:15 2016
  read : io=8000.0MB, bw=5925.1MB/s, iops=1517.4K, runt=  1350msec
rand-read: (groupid=1, jobs=8): err= 0: pid=23156: Mon Mar 28 12:07:15 2016
  read : io=8000.0MB, bw=4889.1MB/s, iops=1251.9K, runt=  1636msec
seq-write: (groupid=2, jobs=8): err= 0: pid=23164: Mon Mar 28 12:07:15 2016
  write: io=8000.0MB, bw=914898KB/s, iops=228724, runt=  8954msec
rand-write: (groupid=3, jobs=8): err= 0: pid=23172: Mon Mar 28 12:07:15 2016
  write: io=8000.0MB, bw=913368KB/s, iops=228342, runt=  8969msec
mixed-seq: (groupid=4, jobs=8): err= 0: pid=23180: Mon Mar 28 12:07:15 2016
  read : io=4003.1MB, bw=881152KB/s, iops=220287, runt=  4653msec
mixed-rand: (groupid=5, jobs=8): err= 0: pid=23189: Mon Mar 28 12:07:15 2016
  read : io=4003.5MB, bw=837491KB/s, iops=209372, runt=  4895msec


2) NR_PROCESS:8 NR_STREAM: 8

seq-read: (groupid=0, jobs=8): err= 0: pid=23248: Mon Mar 28 12:07:57 2016
  read : io=8000.0MB, bw=5847.1MB/s, iops=1497.8K, runt=  1368msec
rand-read: (groupid=1, jobs=8): err= 0: pid=23256: Mon Mar 28 12:07:57 2016
  read : io=8000.0MB, bw=4778.1MB/s, iops=1223.5K, runt=  1674msec
seq-write: (groupid=2, jobs=8): err= 0: pid=23264: Mon Mar 28 12:07:57 2016
  write: io=8000.0MB, bw=1644.7MB/s, iops=420879, runt=  4866msec
rand-write: (groupid=3, jobs=8): err= 0: pid=23272: Mon Mar 28 12:07:57 2016
  write: io=8000.0MB, bw=1507.5MB/s, iops=385905, runt=  5307msec
mixed-seq: (groupid=4, jobs=8): err= 0: pid=23280: Mon Mar 28 12:07:57 2016
  read : io=4003.1MB, bw=1225.1MB/s, iops=313839, runt=  3266msec
mixed-rand: (groupid=5, jobs=8): err= 0: pid=23288: Mon Mar 28 12:07:57 2016
  read : io=4003.5MB, bw=1098.4MB/s, iops=281097, runt=  3646msec


3) NR_PROCESS:8 NR_STREAM: 16

seq-read: (groupid=0, job

Re: zram: per-cpu compression streams

2016-03-24 Thread Sergey Senozhatsky
Hello Minchan,

On (03/25/16 08:41), Minchan Kim wrote:
[..]
> >  Test #10 iozone -t 10 -R -r 80K -s 0M -I +Z
> >Initial write3213973.56  2731512.62  4416466.25*
> >  Rewrite3066956.44* 2693819.50   332671.94
> > Read7769523.25* 2681473.75   462840.44
> >  Re-read5244861.75  5473037.00*  382183.03
> > Reverse Read7479397.25* 4869597.75   374714.06
> >  Stride read5403282.50* 5385083.75   382473.44
> >  Random read5131997.25  5176799.75*  380593.56
> >   Mixed workload3998043.25  4219049.00* 1645850.45
> > Random write3452832.88  3290861.69  3588531.75*
> >   Pwrite3757435.81  2711756.47  4561807.88*
> >Pread2743595.25* 2635835.00   412947.98
> >   Fwrite   16076549.00 16741977.25*14797209.38
> >Fread   23581812.62*21664184.25  5064296.97
> >  =  real 0m44.490s   0m44.444s   0m44.609s
> >  =  user  0m0.054s0m0.049s0m0.055s
> >  =   sys  0m0.037s0m0.046s0m0.148s
> >  
> >  
> >  so when the number of active tasks become larger than the number
> >  of online CPUS, iozone reports a bit hard to understand data. I
> >  can assume that since now we keep the preemption disabled longer
> >  in write path, a concurrent operation (READ or WRITE) cannot preempt
> >  current anymore... slightly suspicious.
> >  
> >  the other hard to understand thing is why do READ-only tests have
> >  such a huge jitter. READ-only tests don't depend on streams, they
> >  don't even use them, we supply compressed data directly to
> >  decompression api.
> >  
> >  may be better retire iozone and never use it again.
> >  
> >  
> >  "118 insertions(+), 238 deletions(-)" the patches remove a big
> >  pile of code.
> 
> First of all, I appreciate you very much!

thanks!

> At a glance, on write workload, huge win but worth to investigate
> how such fluctuation/regression happens on read-related test
> (read and mixed workload).

yes, was going to investigate in more details but got interrupted,
will return back to it today/tomorrow.

> Could you send your patchset? I will test it.

oh, sorry, sure! attached (because it's not a real patch submission
yet, but they look more or less ready I guess).

patches are against next-20160324.

-ss
>From 6bf369f2180dad1c8013a4847ec09d3b9056e910 Mon Sep 17 00:00:00 2001
From: Sergey Senozhatsky 
Subject: [PATCH 1/2] zsmalloc: require GFP in zs_malloc()

Pass GFP flags to zs_malloc() instead of using fixed ones (set
during pool creation), so we can be more flexible. Apart from
that, this also align zs_malloc() interface with zspool/zbud.

Signed-off-by: Sergey Senozhatsky 
---
 drivers/block/zram/zram_drv.c |  2 +-
 include/linux/zsmalloc.h  |  2 +-
 mm/zsmalloc.c | 15 ++-
 3 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 370c2f7..9030992 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -717,7 +717,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 			src = uncmem;
 	}
 
-	handle = zs_malloc(meta->mem_pool, clen);
+	handle = zs_malloc(meta->mem_pool, clen, GFP_NOIO | __GFP_HIGHMEM);
 	if (!handle) {
 		pr_err("Error allocating memory for compressed page: %u, size=%zu\n",
 			index, clen);
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
index 34eb160..6d89f8b 100644
--- a/include/linux/zsmalloc.h
+++ b/include/linux/zsmalloc.h
@@ -44,7 +44,7 @@ struct zs_pool;
 struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
 void zs_destroy_pool(struct zs_pool *pool);
 
-unsigned long zs_malloc(struct zs_pool *pool, size_t size);
+unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t flags);
 void zs_free(struct zs_pool *pool, unsigned long obj);
 
 void *zs_map_object(struct zs_pool *pool, unsigned long handle,
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index e72efb1..19027a1 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -247,7 +247,6 @@ struct zs_pool {
 	struct size_class **size_class;
 	struct kmem_cache *handle_cachep;
 
-	gfp_t flags;	/* allocation flags used when growing pool */
 	atomic_long_t pages_allocated;
 
 	struct zs_pool_stats stats;
@@ -295,10 +294,10 @@ static void destroy_handle_cache(struct zs_pool *pool)
 	kmem_cache_destroy(pool->handle_cachep);
 }
 
-static unsigned long alloc_handle(struct zs_pool *pool)
+static unsigned long alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-		pool->flags & ~__GFP_HIGHMEM);
+		gfp & ~__GFP_HIGHMEM);
 }
 
 static void free_handle(struct zs_pool *pool, unsigned long handle)
@@ -335,7 +334,7 @@ static v

Re: zram: per-cpu compression streams

2016-03-24 Thread Minchan Kim
Hi Sergey,

On Wed, Mar 23, 2016 at 05:18:27PM +0900, Sergey Senozhatsky wrote:
>  ( was "[PATCH] zram: export the number of available comp streams"
>forked from http://marc.info/?l=linux-kernel&m=145860707516861 )
> 
> d'oh sorry, now actually forked.
> 
> 
>  Hello Minchan,
> 
>  forked into a separate tread.
> 
> > On (03/22/16 09:39), Minchan Kim wrote:
> > >   zram_bvec_write()
> > >   {
> > >   *get_cpu_ptr(comp-stream);
> > >zcomp_compress();
> > >zs_malloc()
> > >   put_cpu_ptr(comp-stream);
> > >   }
> > >   
> > >   this, however, makes zsmalloc unhapy. pool has GFP_NOIO | __GFP_HIGHMEM
> > >   gfp, and GFP_NOIO is ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM. this
> > >   __GFP_DIRECT_RECLAIM is in the conflict with per-cpu streams, because
> > >   per-cpu streams require disabled preemption (up until we copy stream
> > >   buffer to zspage). so what options do we have here... from the top of
> > >   my head (w/o a lot of thinking)...
> >  
> >  Indeed.
> ...
> >  How about this?
> >  
> >  zram_bvec_write()
> >  {
> >  retry:
> >  *get_cpu_ptr(comp-stream);
> >  zcomp_compress();
> >  handle = zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM| | GFP_NOWARN)
> >  if (!handle) {
> >  put_cpu_ptr(comp-stream);
> >  handle  = zs_malloc(gfp);
> >  goto retry;
> >  }
> >  put_cpu_ptr(comp-stream);
> >  }
> 
>  interesting. the retry jump should go higher, we have "user_mem = 
> kmap_atomic(page)"
>  which we unmap right after compression, because a) we don't need
>  uncompressed memory anymore b) zs_malloc() can sleep and we can't have atomic
>  mapping around. the nasty thing here is is_partial_io(). we need to re-do
>  
>   if (is_partial_io(bvec))
>   memcpy(uncmem + offset, user_mem + bvec-bv_offset,
>   bvec-bv_len);
>  
>  once again in the worst case.
>  
>  so zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM | GFP_NOWARN) so far can cause
>  double memcpy() and double compression. just to outline this.
>  
>  
>  the test.
>  
>  I executed a number of iozone tests, on each iteration re-creating zram
>  device (3GB, LZO, EXT4. the box has 4 x86_64 CPUs).
>  
>  $DEVICE_SZ=3G
>  $FREE_SPACE is 10% of $DEVICE_SZ
>  time ./iozone -t $i -R -r $((8*$i))K -s $((($DEVICE_SZ/$i - 
> $FREE_SPACE)/(1024*1024)))M -I +Z
>  
>  
>  columns:
>  
> TEST   MAX_STREAMS 4   MAX_STREAMS 8  PER_CPU STREAMS
>  
>  
>  Test #1 iozone -t 1 -R -r 8K -s 2764M -I +Z
>Initial write 853492.31*  835868.50   839789.56
>  Rewrite1642073.88  1657255.75  1693011.50*
> Read3384044.00* 3218727.25  3269109.50
>  Re-read3389794.50* 3243187.00  3267422.25
> Reverse Read3209805.75* 3082040.00  3107957.25
>  Stride read3100144.50* 2972280.25  2923155.25
>  Random read2992249.75* 2874605.00  2854824.25
>   Mixed workload2992274.75* 2878212.25  2883840.00
> Random write1471800.00  1452346.50  1515678.75*
>   Pwrite 802083.00   801627.31   820251.69*
>Pread3443495.00* 3308659.25  3302089.00
>   Fwrite1880446.88  1838607.50  1909490.00*
>Fread3479614.75  3091634.75  6442964.50*
>  =  real  1m4.170s1m4.513s1m4.123s
>  =  user  0m0.559s0m0.518s0m0.511s
>  =   sys 0m18.766s   0m19.264s   0m18.641s
>  
>  
>  Test #2 iozone -t 2 -R -r 16K -s 1228M -I +Z
>Initial write2102532.12  2051809.19  2419072.50*
>  Rewrite2217024.25  2250930.00  3681559.00*
> Read7716933.25  7898759.00  8345507.75*
>  Re-read7748487.75  7765282.25  8342367.50*
> Reverse Read7415254.25  7552637.25  7822691.75*
>  Stride read7041909.50  7091049.25  7401273.00*
>  Random read6205044.25  673.50  7232104.25*
>   Mixed workload4582990.00  5271651.50  5361002.88*
> Random write2591893.62  2513729.88  3660774.38*
>   Pwrite1873876.75  1909758.69  2087238.81*
>Pread4669850.00  4651121.56  4919588.44*
>   Fwrite1937947.25  1940628.06  2034251.25*
>Fread9930319.00  9970078.00* 9831422.50
>  =  real 0m53.844s   0m53.607s   0m52.528s
>  =  user  0m0.273s0m0.289s0m0.280s
>  =   sys 0m16.595s   0m16.478s   0m14.072s
>  
>  
>  Test #3 iozone -t 3 -R -r 24K -s 716M -I +Z
>Initial write  

Re: zram: per-cpu compression streams

2016-03-23 Thread Sergey Senozhatsky
 ( was "[PATCH] zram: export the number of available comp streams"
   forked from http://marc.info/?l=linux-kernel&m=145860707516861 )

d'oh sorry, now actually forked.


 Hello Minchan,

 forked into a separate tread.

> On (03/22/16 09:39), Minchan Kim wrote:
> >   zram_bvec_write()
> >   {
> > *get_cpu_ptr(comp-stream);
> >  zcomp_compress();
> >  zs_malloc()
> > put_cpu_ptr(comp-stream);
> >   }
> >   
> >   this, however, makes zsmalloc unhapy. pool has GFP_NOIO | __GFP_HIGHMEM
> >   gfp, and GFP_NOIO is ___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM. this
> >   __GFP_DIRECT_RECLAIM is in the conflict with per-cpu streams, because
> >   per-cpu streams require disabled preemption (up until we copy stream
> >   buffer to zspage). so what options do we have here... from the top of
> >   my head (w/o a lot of thinking)...
>  
>  Indeed.
...
>  How about this?
>  
>  zram_bvec_write()
>  {
>  retry:
>  *get_cpu_ptr(comp-stream);
>  zcomp_compress();
>  handle = zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM| | GFP_NOWARN)
>  if (!handle) {
>  put_cpu_ptr(comp-stream);
>  handle  = zs_malloc(gfp);
>  goto retry;
>  }
>  put_cpu_ptr(comp-stream);
>  }

 interesting. the retry jump should go higher, we have "user_mem = 
kmap_atomic(page)"
 which we unmap right after compression, because a) we don't need
 uncompressed memory anymore b) zs_malloc() can sleep and we can't have atomic
 mapping around. the nasty thing here is is_partial_io(). we need to re-do
 
if (is_partial_io(bvec))
memcpy(uncmem + offset, user_mem + bvec-bv_offset,
bvec-bv_len);
 
 once again in the worst case.
 
 so zs_malloc((gfp &~ __GFP_DIRECT_RECLAIM | GFP_NOWARN) so far can cause
 double memcpy() and double compression. just to outline this.
 
 
 the test.
 
 I executed a number of iozone tests, on each iteration re-creating zram
 device (3GB, LZO, EXT4. the box has 4 x86_64 CPUs).
 
 $DEVICE_SZ=3G
 $FREE_SPACE is 10% of $DEVICE_SZ
 time ./iozone -t $i -R -r $((8*$i))K -s $((($DEVICE_SZ/$i - 
$FREE_SPACE)/(1024*1024)))M -I +Z
 
 
 columns:
 
TEST   MAX_STREAMS 4   MAX_STREAMS 8  PER_CPU STREAMS
 
 
 Test #1 iozone -t 1 -R -r 8K -s 2764M -I +Z
   Initial write 853492.31*  835868.50   839789.56
 Rewrite1642073.88  1657255.75  1693011.50*
Read3384044.00* 3218727.25  3269109.50
 Re-read3389794.50* 3243187.00  3267422.25
Reverse Read3209805.75* 3082040.00  3107957.25
 Stride read3100144.50* 2972280.25  2923155.25
 Random read2992249.75* 2874605.00  2854824.25
  Mixed workload2992274.75* 2878212.25  2883840.00
Random write1471800.00  1452346.50  1515678.75*
  Pwrite 802083.00   801627.31   820251.69*
   Pread3443495.00* 3308659.25  3302089.00
  Fwrite1880446.88  1838607.50  1909490.00*
   Fread3479614.75  3091634.75  6442964.50*
 =  real  1m4.170s1m4.513s1m4.123s
 =  user  0m0.559s0m0.518s0m0.511s
 =   sys 0m18.766s   0m19.264s   0m18.641s
 
 
 Test #2 iozone -t 2 -R -r 16K -s 1228M -I +Z
   Initial write2102532.12  2051809.19  2419072.50*
 Rewrite2217024.25  2250930.00  3681559.00*
Read7716933.25  7898759.00  8345507.75*
 Re-read7748487.75  7765282.25  8342367.50*
Reverse Read7415254.25  7552637.25  7822691.75*
 Stride read7041909.50  7091049.25  7401273.00*
 Random read6205044.25  673.50  7232104.25*
  Mixed workload4582990.00  5271651.50  5361002.88*
Random write2591893.62  2513729.88  3660774.38*
  Pwrite1873876.75  1909758.69  2087238.81*
   Pread4669850.00  4651121.56  4919588.44*
  Fwrite1937947.25  1940628.06  2034251.25*
   Fread9930319.00  9970078.00* 9831422.50
 =  real 0m53.844s   0m53.607s   0m52.528s
 =  user  0m0.273s0m0.289s0m0.280s
 =   sys 0m16.595s   0m16.478s   0m14.072s
 
 
 Test #3 iozone -t 3 -R -r 24K -s 716M -I +Z
   Initial write3036567.50  2998918.25  3683853.00*
 Rewrite3402447.88  3415685.88  5054705.38*
Read   11767413.00*11133789.50 11246497.25
 Re-read   11797680.50*11092592.00 11277382.00
Reverse Read   10828320.00*10157665.50 10749055.00
 Stride rea