Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Karan Singh
Hi

What type of clients do you have.

- Are they Linux physical OR VM mounting Ceph RBD or CephFS ??
- Or they are simply openstack / cloud instances using Ceph as cinder volumes 
or something like that ??


- Karan -

 On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote:
 
 We've built Ceph cluster:
 3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds 
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule 
 
 
 
 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 
 
 Starting 16 processes
 
 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
 slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
  lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat percentiles (usec):
  |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
  | 99.99th=[   62]
 bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
 lat (usec) : 100=0.03%
   cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec
 
 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
 14:05:59 2015
   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
 slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
  lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
 clat percentiles (usec):
  |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
  | 99.99th=[   56]
 bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
 lat (usec) : 100=0.03%
   cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
 mint=242331msec, maxt=243869msec
 
 - And almost the same(?!) aggregated result from the second client: 
 -
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
 mint=244697msec, maxt=246941msec
 
 - If I'll summarize: -
 aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s 
 
 it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
 was divided? why?
 Question: If I'll connect 12 clients nodes - each one can write just on 
 100MB/s?
 Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and 
 it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? 
 
 
 
 health HEALTH_OK
  monmap e1: 3 mons at 
 {mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0},
  election epoch 140, quorum 0,1,2 mon1,mon2,mon3
  mdsmap e12: 1/1/1 up {0=mon3=up:active}
  osdmap e832: 31 osds: 30 up, 30 in
   pgmap v106186: 6144 pgs, 3 pools, 2306 GB 

[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
3 osd nodes (each one have 10 osd + 2 ssd for journaling)
switch 24 ports x 10G
10 gigabit - for public network
20 gigabit bonding - between osds
Ubuntu 12.04.05
Ceph 0.87.2
-
Clients has:
10 gigabit for ceph-connection
CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule



== fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
Single client:


Starting 16 processes

.below is just 1 job info
trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
13:26:24 2015
  write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat percentiles (usec):
 |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
 | 99.99th=[   62]
bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, stdev=26397.76
lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
lat (usec) : 100=0.03%
  cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

...what's above repeats 16 times...

Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
mint=133312msec, maxt=134329msec

+
Two clients:
+
below is just 1 job info
trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
14:05:59 2015
  write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
clat percentiles (usec):
 |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
 | 99.99th=[   56]
bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, stdev=21905.92
lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
lat (usec) : 100=0.03%
  cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

...what's above repeats 16 times...

Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
mint=242331msec, maxt=243869msec

- And almost the same(?!) aggregated result from the second client: 
-

Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
mint=244697msec, maxt=246941msec

- If I'll summarize: -
aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s

it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
was divided? why?
Question: If I'll connect 12 clients nodes - each one can write just on 100MB/s?
Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and it'll 
serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not?



health HEALTH_OK
 monmap e1: 3 mons at 
{mon1=192.168.56.251:6789/0,mon2=192.168.56.252:6789/0,mon3=192.168.56.253:6789/0},
 election epoch 140, quorum 0,1,2 mon1,mon2,mon3
 mdsmap e12: 1/1/1 up {0=mon3=up:active}
 osdmap e832: 31 osds: 30 up, 30 in
  pgmap v106186: 6144 pgs, 3 pools, 2306 GB data, 1379 kobjects
4624 GB used, 104 TB / 109 TB avail
6144 active+clean


Perhaps, I don't understand something in Ceph architecture? I thought, that:

Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on each node = 
aggregated write speed is ~ 900MB/s (because of striping etc.)
And we have 3 OSD nodes, and objects are striped also on 30 osds - I thought 
it's also 

[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi,

But my question is why speed is divided between clients?
And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph,
that each cephfs-client could write with his max network speed (10Gbit/s ~ 
1.2GB/s)???



From: Johannes Formann mlm...@formann.de
Sent: Tuesday, July 28, 2015 12:46 PM
To: Shneur Zalman Mattern
Subject: Re: [ceph-users] Did maximum performance reached?

Hi,

size=3 would decrease your performance. But with size=2 your results are not 
bad too:
Math:
size=2 means each write is written 4 times (2 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD
2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD


greetings

Johannes

 Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:

 Hi, Johannes (that's my grandpa's name)

 The size is 2, do you really think that number of replicas can increase 
 performance?
 on the  http://ceph.com/docs/master/architecture/
 written Note: Striping is independent of object replicas. Since CRUSH 
 replicates objects across OSDs, stripes get replicated automatically. 

 OK, I'll check it,
 Regards, Shneur
 
 From: Johannes Formann mlm...@formann.de
 Sent: Tuesday, July 28, 2015 12:09 PM
 To: Shneur Zalman Mattern
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Did maximum performance reached?

 Hello,

 what is the „size“ parameter of your pool?

 Some math do show the impact:
 size=3 means each write is written 6 times (3 copies, first journal, later 
 disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD

 If you use size=3, the results are as good as one can expect. (Even with 
 size=2 the results won’t be bad)

 greetings

 Johannes

 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:

 We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
3 osd nodes (each one have 10 osd + 2 ssd for journaling)
switch 24 ports x 10G
10 gigabit - for public network
20 gigabit bonding - between osds
Ubuntu 12.04.05
Ceph 0.87.2
 -
 Clients has:
10 gigabit for ceph-connection
CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule



 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 

 Starting 16 processes

 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
  write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
 lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
clat percentiles (usec):
 |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
 | 99.99th=[   62]
bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
lat (usec) : 100=0.03%
  cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
 issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

 ...what's above repeats 16 times...

 Run status group 0 (all jobs):
  WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec

 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 
 28 14:05:59 2015
  write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
 lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
clat percentiles (usec):
 |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
 | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
 | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
 | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
 | 99.99th=[   56]
bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37

[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi!

And so, in your math
I need to build size = osd, 30 replicas for my cluster of 120TB - to get my 
demans 
And 4TB real storage capacity in price 3000$ per 1TB? Joke?

All the best,
Shneur

From: Johannes Formann mlm...@formann.de
Sent: Tuesday, July 28, 2015 12:46 PM
To: Shneur Zalman Mattern
Subject: Re: [ceph-users] Did maximum performance reached?

Hi,

size=3 would decrease your performance. But with size=2 your results are not 
bad too:
Math:
size=2 means each write is written 4 times (2 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD
2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD


greetings

Johannes

 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi, Karan!

That's physical CentOS clients of CephFS mounted by kernel-module (kernel 4.1.3)

Thanks

Hi

What type of clients do you have.

- Are they Linux physical OR VM mounting Ceph RBD or CephFS ??
- Or they are simply openstack / cloud instances using Ceph as cinder volumes 
or something like that ??


- Karan -

 On 28 Jul 2015, at 11:53, Shneur Zalman Mattern shz...@eimsys.co.il wrote:

 We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 4.1.3 equipped by cephfs-kmodule



 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Oh, now I've to cry :-)
not because it's not SSDs... it's SAS2 HDDs

Because, I need to build something for 140 clients... 4200 OSDs

:-(

Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 
2PB 
Perhaps, tiering cache pool can save my money, but I've read here - that it's 
slower than all people think...

:-(

Why Lustre is more performable? There're same HDDs?

 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
Hi, Johannes (that's my grandpa's name)

The size is 2, do you really think that number of replicas can increase 
performance?
on the  http://ceph.com/docs/master/architecture/
written Note: Striping is independent of object replicas. Since CRUSH 
replicates objects across OSDs, stripes get replicated automatically. 

OK, I'll check it,
Regards, Shneur

From: Johannes Formann mlm...@formann.de
Sent: Tuesday, July 28, 2015 12:09 PM
To: Shneur Zalman Mattern
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Did maximum performance reached?

Hello,

what is the „size“ parameter of your pool?

Some math do show the impact:
size=3 means each write is written 6 times (3 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD

If you use size=3, the results are as good as one can expect. (Even with size=2 
the results won’t be bad)

greetings

Johannes

 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:

 We've built Ceph cluster:
 3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule



 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 

 Starting 16 processes

 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
 slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
  lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat percentiles (usec):
  |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
  | 99.99th=[   62]
 bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
 lat (usec) : 100=0.03%
   cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

 ...what's above repeats 16 times...

 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec

 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
 14:05:59 2015
   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
 slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
  lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
 clat percentiles (usec):
  |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
  | 99.99th=[   56]
 bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
 lat (usec) : 100=0.03%
   cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0

 ...what's above repeats 16 times...

 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
 mint=242331msec, maxt=243869msec

 - And almost the same(?!) aggregated result from the second client: 
 -

 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
 mint=244697msec, maxt=246941msec

 - If I'll

Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Johannes Formann
Hello,

what is the „size“ parameter of your pool?

Some math do show the impact:
size=3 means each write is written 6 times (3 copies, first journal, later 
disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:

3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD

If you use size=3, the results are as good as one can expect. (Even with size=2 
the results won’t be bad)

greetings

Johannes

 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 We've built Ceph cluster:
 3 mon nodes (one of them is combined with mds)
 3 osd nodes (each one have 10 osd + 2 ssd for journaling)
 switch 24 ports x 10G
 10 gigabit - for public network
 20 gigabit bonding - between osds 
 Ubuntu 12.04.05
 Ceph 0.87.2
 -
 Clients has:
 10 gigabit for ceph-connection
 CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule 
 
 
 
 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 ===
 Single client:
 
 
 Starting 16 processes
 
 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 28 
 13:26:24 2015
   write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
 slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
  lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
 clat percentiles (usec):
  |  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
  | 99.99th=[   62]
 bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
 lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
 lat (usec) : 100=0.03%
   cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec
 
 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 28 
 14:05:59 2015
   write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
 slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
 clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
  lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
 clat percentiles (usec):
  |  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
  | 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
  | 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
  | 99.00th=[8], 99.50th=[   10], 99.90th=[   28], 99.95th=[   37],
  | 99.99th=[   56]
 bw (KB/s)  : min=20630, max=276480, per=6.30%, avg=43328.34, 
 stdev=21905.92
 lat (usec) : 2=0.84%, 4=49.45%, 10=49.13%, 20=0.37%, 50=0.18%
 lat (usec) : 100=0.03%
   cpu  : usr=0.49%, sys=5.68%, ctx=31428, majf=0, minf=9
   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times... 
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=687960KB/s, minb=42997KB/s, maxb=43270KB/s, 
 mint=242331msec, maxt=243869msec
 
 - And almost the same(?!) aggregated result from the second client: 
 -
 
 Run status group 0 (all jobs):
   WRITE: io=163840MB, aggrb=679401KB/s, minb=42462KB/s, maxb=42852KB/s, 
 mint=244697msec, maxt=246941msec
 
 - If I'll summarize: -
 aggrb1 + aggrb2 = 687960KB/s + 679401KB/s = 1367MB/s 
 
 it looks like the same bandwidth from just one client aggrb=1219.8MB/s and it 
 was divided? why?
 Question: If I'll connect 12 clients nodes - each one can write just on 
 100MB/s?
 Perhaps, I need to scale out our ceph up to 15(how many?) OSD nodes - and 
 it'll serve 2 clients on the 1.3GB/s (bw of 10gig nic), or not? 
 
 
 
 health HEALTH_OK
  monmap e1: 3 mons at 
 

Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Johannes Formann
The speed is divided because ist fair :)
You reach the limit your hardware (I guess the SSDs) can deliver.

For 2 clients each doing 1200 MB/s you’ll have basically to double the amount 
of OSDs.

greetings

Johannes

 Am 28.07.2015 um 11:56 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 Hi,
 
 But my question is why speed is divided between clients?
 And how much OSDnodes, OSDdaemos, PGs, I have to add/remove to ceph,
 that each cephfs-client could write with his max network speed (10Gbit/s ~ 
 1.2GB/s)???
 
 
 
 From: Johannes Formann mlm...@formann.de
 Sent: Tuesday, July 28, 2015 12:46 PM
 To: Shneur Zalman Mattern
 Subject: Re: [ceph-users] Did maximum performance reached?
 
 Hi,
 
 size=3 would decrease your performance. But with size=2 your results are not 
 bad too:
 Math:
 size=2 means each write is written 4 times (2 copies, first journal, later 
 disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:
 
 2 (size) * 1300 MB/s / 6 (SSD) = 433MB/s each SSD
 2 (size) * 1300 MB/s / 30 (HDD) = 87MB/s each HDD
 
 
 greetings
 
 Johannes
 
 Am 28.07.2015 um 11:41 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 Hi, Johannes (that's my grandpa's name)
 
 The size is 2, do you really think that number of replicas can increase 
 performance?
 on the  http://ceph.com/docs/master/architecture/
 written Note: Striping is independent of object replicas. Since CRUSH 
 replicates objects across OSDs, stripes get replicated automatically. 
 
 OK, I'll check it,
 Regards, Shneur
 
 From: Johannes Formann mlm...@formann.de
 Sent: Tuesday, July 28, 2015 12:09 PM
 To: Shneur Zalman Mattern
 Cc: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Did maximum performance reached?
 
 Hello,
 
 what is the „size“ parameter of your pool?
 
 Some math do show the impact:
 size=3 means each write is written 6 times (3 copies, first journal, later 
 disk). Calculating with 1.300MB/s „Client“ Bandwidth that means:
 
 3 (size) * 1300 MB/s / 6 (SSD) = 650MB/s per SSD
 3 (size) * 1300 MB/s / 30 (HDD) = 130MB/s per HDD
 
 If you use size=3, the results are as good as one can expect. (Even with 
 size=2 the results won’t be bad)
 
 greetings
 
 Johannes
 
 Am 28.07.2015 um 10:53 schrieb Shneur Zalman Mattern shz...@eimsys.co.il:
 
 We've built Ceph cluster:
   3 mon nodes (one of them is combined with mds)
   3 osd nodes (each one have 10 osd + 2 ssd for journaling)
   switch 24 ports x 10G
   10 gigabit - for public network
   20 gigabit bonding - between osds
   Ubuntu 12.04.05
   Ceph 0.87.2
 -
 Clients has:
   10 gigabit for ceph-connection
   CentOS 6.6 with kernel 3.19.8 equipped by cephfs-kmodule
 
 
 
 == fio-2.0.13 seqwrite, bs=1M, filesize=10G, parallel-jobs=16 
 ===
 Single client:
 
 
 Starting 16 processes
 
 .below is just 1 job info
 trivial-readwrite-grid01: (groupid=0, jobs=1): err= 0: pid=10484: Tue Jul 
 28 13:26:24 2015
 write: io=10240MB, bw=78656KB/s, iops=76 , runt=133312msec
   slat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
   clat (usec): min=1 , max=68 , avg= 3.61, stdev= 1.99
lat (msec): min=1 , max=117 , avg=13.01, stdev=12.57
   clat percentiles (usec):
|  1.00th=[1],  5.00th=[2], 10.00th=[2], 20.00th=[2],
| 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4],
| 70.00th=[4], 80.00th=[5], 90.00th=[5], 95.00th=[6],
| 99.00th=[9], 99.50th=[   10], 99.90th=[   23], 99.95th=[   28],
| 99.99th=[   62]
   bw (KB/s)  : min=35790, max=318215, per=6.31%, avg=78816.91, 
 stdev=26397.76
   lat (usec) : 2=1.33%, 4=54.43%, 10=43.54%, 20=0.56%, 50=0.11%
   lat (usec) : 100=0.03%
 cpu  : usr=0.89%, sys=12.85%, ctx=58248, majf=0, minf=9
 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, =64=0.0%
submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
issued: total=r=0/w=10240/d=0, short=r=0/w=0/d=0
 
 ...what's above repeats 16 times...
 
 Run status group 0 (all jobs):
 WRITE: io=163840MB, aggrb=1219.8MB/s, minb=78060KB/s, maxb=78655KB/s, 
 mint=133312msec, maxt=134329msec
 
 +
 Two clients:
 +
 below is just 1 job info
 trivial-readwrite-gridsrv: (groupid=0, jobs=1): err= 0: pid=10605: Tue Jul 
 28 14:05:59 2015
 write: io=10240MB, bw=43154KB/s, iops=42 , runt=242984msec
   slat (usec): min=991 , max=285653 , avg=23716.12, stdev=23960.60
   clat (usec): min=1 , max=65 , avg= 3.67, stdev= 2.02
lat (usec): min=994 , max=285664 , avg=23723.39, stdev=23962.22
   clat percentiles (usec):
|  1.00th=[2],  5.00th=[2], 10.00th=[2], 20.00th=[2],
| 30.00th=[3], 40.00th=[3], 50.00th=[3], 60.00th=[4

Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread John Spray



On 28/07/15 11:17, Shneur Zalman Mattern wrote:

Oh, now I've to cry :-)
not because it's not SSDs... it's SAS2 HDDs

Because, I need to build something for 140 clients... 4200 OSDs

:-(

Looks like, I can pickup my performance by SSDs, but I need a huge capacity ~ 
2PB
Perhaps, tiering cache pool can save my money, but I've read here - that it's 
slower than all people think...

:-(

Why Lustre is more performable? There're same HDDs?


Lustre isn't (A) creating two copies of your data, and it's (B) not 
executing disk writes as atomic transactions (i.e. no data writeahead log).


The A tradeoff is that while a Lustre system typically requires an 
expensive dual ported RAID controller, Ceph doesn't.  You take the money 
you saved on RAID controllers have spend it on having a larger number of 
cheaper hosts and drives.  If you've already bought the Lustre-oriented 
hardware then my advice would be to run Lustre on it :-)


The efficient way of handling B is to use SSD journals for your OSDs.  
Typical Ceph servers have one SSD per approx 4 OSDs.


John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread John Spray



On 28/07/15 11:53, John Spray wrote:



On 28/07/15 11:17, Shneur Zalman Mattern wrote:

Oh, now I've to cry :-)
not because it's not SSDs... it's SAS2 HDDs

Because, I need to build something for 140 clients... 4200 OSDs

:-(

Looks like, I can pickup my performance by SSDs, but I need a huge 
capacity ~ 2PB
Perhaps, tiering cache pool can save my money, but I've read here - 
that it's slower than all people think...


:-(

Why Lustre is more performable? There're same HDDs?


Lustre isn't (A) creating two copies of your data, and it's (B) not 
executing disk writes as atomic transactions (i.e. no data writeahead 
log).


The A tradeoff is that while a Lustre system typically requires an 
expensive dual ported RAID controller, Ceph doesn't.  You take the 
money you saved on RAID controllers have spend it on having a larger 
number of cheaper hosts and drives.  If you've already bought the 
Lustre-oriented hardware then my advice would be to run Lustre on it :-)


The efficient way of handling B is to use SSD journals for your OSDs.  
Typical Ceph servers have one SSD per approx 4 OSDs.


Oh, I've just re-read the original message in this thread, and you're 
already using SSD journals.


So I think the only point of confusion was that you weren't dividing 
your expected bandwidth number by the number of replicas, right?


 Each spindel-disk can write ~ 100MB/s , and we have 10 SAS disks on 
each node = aggregated write speed is ~ 900MB/s (because of striping etc.)
And we have 3 OSD nodes, and objects are striped also on 30 osds - I 
thought it's also aggregateble and we'll get something around 2.5 GB/s, 
but not...


Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 = 
1300MB/s -- so I think you're actually doing pretty well with your 
1367MB/s number.


John





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Shneur Zalman Mattern
As I'm understanding now that's in this case (30 disks) 10Gbit Network is not a 
bottleneck!

With other HW config ( + 5 OSD nodes = + 50 disks ) I'd get 3400 MB/s,
and 3 clients can work on full bandwidth, yes?

OK, let's try ! ! ! ! ! ! !

Perhaps, somebody has more suggestions for increasing performance:
1. NVMe journals, 
2. btrfs over osd
3. ssd-based osds,
4. 15K hdds 
5. RAID 10 on each OSD node
.
everybody - brainstorm!!!

John:
Your expected bandwidth (with size=2 replicas) will be (900MB/s * 3)/2 =
1300MB/s -- so I think you're actually doing pretty well with your
1367MB/s number.











This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.





 
 

This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Did maximum performance reached?

2015-07-28 Thread Udo Lembke
Hi,

On 28.07.2015 12:02, Shneur Zalman Mattern wrote:
 Hi!

 And so, in your math
 I need to build size = osd, 30 replicas for my cluster of 120TB - to get my 
 demans 
30 replicas is the wrong math! Less replicas = more speed (because of
less writing).
More replicas less speed.
Fore data safety an replica of 3 is recommended.


Udo

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com