Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Udo Lembke
Hi,
have done the test again in a cleaner way.

Same pool, same VM, different hosts (qemu 2.4 + qemu 2.2) but same hardware.
But only one run!

The biggest difference is due cache settings:

qemu2.4 cache=writethrough  iops=3823 bw=15294KB/s
qemu2.4 cache=writeback  iops=8837 bw=35348KB/s
qemu2.2 cache=writethrough  iops=2996 bw=11988KB/s
qemu2.2 cache=writeback  iops=7980 bw=31921KB/s

iothread does change anything, because only one disk is used.

Test:
io --time_based --name=benchmark --size=4G --filename=test.bin
--ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1
--verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
--group_reporting


Udo

On 22.11.2015 23:59, Udo Lembke wrote:
> Hi Zoltan,
> you are right ( but this was two running systems...).
>
> I see also an big failure: "--filename=/mnt/test.bin" (use simply
> copy/paste without to much thinking :-( )
> The root filesystem is not on ceph (on both servers).
> So my measurements are not valid!!
>
> I would do the measurements clean tomorow.
>
>
> Udo
>
>
> On 22.11.2015 14:29, Zoltan Arnold Nagy wrote:
>> It would have been more interesting if you had tweaked only one
>> option as now we can’t be sure which changed had what impact… :-)
>>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Udo Lembke
Hi Zoltan,
you are right ( but this was two running systems...).

I see also an big failure: "--filename=/mnt/test.bin" (use simply
copy/paste without to much thinking :-( )
The root filesystem is not on ceph (on both servers).
So my measurements are not valid!!

I would do the measurements clean tomorow.


Udo


On 22.11.2015 14:29, Zoltan Arnold Nagy wrote:
> It would have been more interesting if you had tweaked only one option
> as now we can’t be sure which changed had what impact… :-)
>
>> On 22 Nov 2015, at 04:29, Udo Lembke > > wrote:
>>
>> Hi Sean,
>> Haomai is right, that qemu can have a huge performance differences.
>>
>> I have done two test to the same ceph-cluster (different pools, but
>> this should not do any differences).
>> One test with proxmox ve 4 (qemu 2.4, iothread for device, and
>> cache=writeback) gives 14856 iops
>> Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives
>> 5070 iops only.
>>
>> Here the results in long:
>> ### proxmox ve 3.x ###
>> kvm --version
>> QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard
>>
>> VM:
>> virtio2: ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G
>>
>> root@fileserver:/daten/support/test# fio --time_based
>> --name=benchmark --size=4G --filename=/mnt/test.bin --ioengine=libaio
>> --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0
>> --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
>> --group_reporting
>> fio: time_based requires a runtime/timeout setting
>> benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio, iodepth=128
>> ...
>> fio-2.1.11
>> Starting 4 processes
>> benchmark: Laying out IO file(s) (1 file(s) / 4096MB)
>> Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s]
>> [0/10.6K/0 iops] [eta 00m:00s]
>> benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47
>> 2015
>>   write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec
>> slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26
>> clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17
>>  lat (msec): min=1, max=2755, avg=100.76, stdev=147.54
>> clat percentiles (msec):
>>  |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19], 20.00th=[  
>> 28],
>>  | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51], 60.00th=[  
>> 63],
>>  | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[ 
>> 367],
>>  | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[
>> 1713],
>>  | 99.99th=[ 2573]
>> bw (KB  /s): min=4, max=30726, per=26.90%, avg=5456.84,
>> stdev=3014.45
>> lat (usec) : 750=0.01%, 1000=0.01%
>> lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%
>> lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55%
>> lat (msec) : 2000=0.29%, >=2000=0.03%
>>   cpu  : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30
>>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%,
>> >=64=100.0%
>>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.0%
>>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>> >=64=0.1%
>>  issued: total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
>>  latency   : target=0, window=0, percentile=100.00%, depth=128
>>
>> Run status group 0 (all jobs):
>>   WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s,
>> mint=827178msec, maxt=827178msec
>>
>> Disk stats (read/write):
>> dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824,
>> in_queue=105927128, util=100.00%, aggrios=1/4469640,
>> aggrmerge=0/14788, aggrticks=64/103711096, aggrin_queue=104165356,
>> aggrutil=100.00%
>>   vda: ios=1/4469640, merge=0/14788, ticks=64/103711096,
>> in_queue=104165356, util=100.00%
>>
>> ##
>>
>> ### proxmox ve 4.x ###
>> kvm --version
>> QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c)
>> 2003-2008 Fabrice Bellard
>>
>> grep ceph /etc/pve/qemu-server/102.conf
>> virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G
>>
>> root@fileserver-test:/daten/tv01/test# fio --time_based
>> --name=benchmark --size=4G --filename=/mnt/test.bin --ioengine=libaio
>> --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0
>> --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k
>> --group_reporting  
>> fio: time_based requires a runtime/timeout
>> setting  
>> 
>>
>> benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K,
>> ioengine=libaio,
>> iodepth=128 
>> ...  
>>   
>>
>> fio-2.1.11
>> Starting 4 

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Alexandre DERUMIER
>>One test with proxmox ve 4 (qemu 2.4, iothread for device, and 
>>cache=writeback) gives 14856 iops

Please also note that qemu in proxmox ve 4 is compiled with jemalloc.


- Mail original -
De: "Udo Lembke" 
À: "Sean Redmond" 
Cc: "ceph-users" 
Envoyé: Dimanche 22 Novembre 2015 04:29:29
Objet: Re: [ceph-users] All SSD Pool - Odd Performance

Hi Sean, 
Haomai is right, that qemu can have a huge performance differences. 

I have done two test to the same ceph-cluster (different pools, but this should 
not do any differences). 
One test with proxmox ve 4 (qemu 2.4, iothread for device, and cache=writeback) 
gives 14856 iops 
Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives 5070 iops 
only. 

Here the results in long: 
### proxmox ve 3.x ### 
kvm --version 
QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard 

VM: 
virtio2: ceph_ file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G 

root@fileserver:/daten/support/test# fio --time_based --name=benchmark 
--size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0 
--iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 
--rw=randwrite --blocksize=4k --group_reporting 
fio: time_based requires a runtime/timeout setting 
benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=128 
... 
fio-2.1.11 
Starting 4 processes 
benchmark: Laying out IO file(s) (1 file(s) / 4096MB) 
Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s] [0/10.6K/0 
iops] [eta 00m:00s] 
benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47 2015 
write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec 
slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26 
clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17 
lat (msec): min=1, max=2755, avg=100.76, stdev=147.54 
clat percentiles (msec): 
| 1.00th=[ 10], 5.00th=[ 14], 10.00th=[ 19], 20.00th=[ 28], 
| 30.00th=[ 36], 40.00th=[ 43], 50.00th=[ 51], 60.00th=[ 63], 
| 70.00th=[ 81], 80.00th=[ 128], 90.00th=[ 237], 95.00th=[ 367], 
| 99.00th=[ 717], 99.50th=[ 889], 99.90th=[ 1516], 99.95th=[ 1713], 
| 99.99th=[ 2573] 
bw (KB /s): min= 4, max=30726, per=26.90%, avg=5456.84, stdev=3014.45 
lat (usec) : 750=0.01%, 1000=0.01% 
lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74% 
lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55% 
lat (msec) : 2000=0.29%, >=2000=0.03% 
cpu : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1% 
issued : total=r=0/w=4194304/d=0, short=r=0/w=0/d=0 
latency : target=0, window=0, percentile=100.00%, depth=128 

Run status group 0 (all jobs): 
WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s, 
mint=827178msec, maxt=827178msec 

Disk stats (read/write): 
dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824, in_queue=105927128, 
util=100.00%, aggrios=1/4469640, aggrmerge=0/14788, aggrticks=64/103711096, 
aggrin_queue=104165356, aggrutil=100.00% 
vda: ios=1/4469640, merge=0/14788, ticks=64/103711096, in_queue=104165356, 
util=100.00% 

## 

### proxmox ve 4.x ### 
kvm --version 
QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c) 2003-2008 
Fabrice Bellard 

grep ceph /etc/pve/qemu-server/102.conf 
virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G 

root@fileserver-test:/daten/tv01/test# fio --time_based --name=benchmark 
--size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0 
--iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 
--rw=randwrite --blocksize=4k --group_reporting 
fio: time_based requires a runtime/timeout setting 
benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
iodepth=128 
... 
fio-2.1.11 
Starting 4 processes 
Jobs: 4 (f=4): [w(4)] [99.6% done] [0KB/56148KB/0KB /s] [0/14.4K/0 iops] [eta 
00m:01s] 
benchmark: (groupid=0, jobs=4): err= 0: pid=26131: Sun Nov 22 03:51:04 2015 
write: io=0B, bw=59425KB/s, iops=14856, runt=282327msec 
slat (usec): min=6, max=216925, avg=261.78, stdev=1802.78 
clat (msec): min=1, max=330, avg=34.04, stdev=27.78 
lat (msec): min=1, max=330, avg=34.30, stdev=27.87 
clat percentiles (msec): 
| 1.00th=[ 10], 5.00th=[ 13], 10.00th=[ 14], 20.00th=[ 16], 
| 30.00th=[ 18], 40.00th=[ 19], 50.00th=[ 21], 60.00th=[ 24], 
| 70.00th=[ 33], 80.00th=[ 62], 90.00th=[ 81], 95.00th=[ 87], 
| 99.00th=[ 95], 99.50th=[ 100], 99.90th=[ 269], 99.95th=[ 277], 
| 99.99th=[ 297] 
bw (KB /s): min= 3, max=42216, per=25.10%, avg=14917.03, stdev=2990.50 
lat (msec) : 2=0.01%, 4=0.01%, 10=1.13%, 20=45.52%, 50=28.23% 
lat (msec) : 100=24.61%, 

Re: [ceph-users] All SSD Pool - Odd Performance

2015-11-22 Thread Zoltan Arnold Nagy
It would have been more interesting if you had tweaked only one option as now 
we can’t be sure which changed had what impact… :-)

> On 22 Nov 2015, at 04:29, Udo Lembke  wrote:
> 
> Hi Sean,
> Haomai is right, that qemu can have a huge performance differences.
> 
> I have done two test to the same ceph-cluster (different pools, but this 
> should not do any differences).
> One test with proxmox ve 4 (qemu 2.4, iothread for device, and 
> cache=writeback) gives 14856 iops
> Same test with proxmox ve 3.4 (qemu 2.2.1, cache=writethrough) gives 5070 
> iops only.
> 
> Here the results in long:
> ### proxmox ve 3.x ###
> kvm --version
> QEMU emulator version 2.2.1, Copyright (c) 2003-2008 Fabrice Bellard
> 
> VM:
> virtio2: ceph_file:vm-405-disk-1,cache=writethrough,backup=no,size=4096G 
> 
> 
> root@fileserver:/daten/support/test# fio --time_based --name=benchmark 
> --size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0 
> --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 
> --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
> fio: time_based requires a runtime/timeout setting
> benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
> iodepth=128
> ...
> fio-2.1.11
> Starting 4 processes
> benchmark: Laying out IO file(s) (1 file(s) / 4096MB)
> Jobs: 1 (f=1): [_(1),w(1),_(2)] [100.0% done] [0KB/40024KB/0KB /s] [0/10.6K/0 
> iops] [eta 00m:00s]
> benchmark: (groupid=0, jobs=4): err= 0: pid=7821: Sun Nov 22 04:07:47 2015
>   write: io=16384MB, bw=20282KB/s, iops=5070, runt=827178msec
> slat (usec): min=0, max=2531.7K, avg=778.68, stdev=12757.26
> clat (usec): min=508, max=2755.2K, avg=99980.14, stdev=146967.17
>  lat (msec): min=1, max=2755, avg=100.76, stdev=147.54
> clat percentiles (msec):
>  |  1.00th=[   10],  5.00th=[   14], 10.00th=[   19], 20.00th=[   28],
>  | 30.00th=[   36], 40.00th=[   43], 50.00th=[   51], 60.00th=[   63],
>  | 70.00th=[   81], 80.00th=[  128], 90.00th=[  237], 95.00th=[  367],
>  | 99.00th=[  717], 99.50th=[  889], 99.90th=[ 1516], 99.95th=[ 1713],
>  | 99.99th=[ 2573]
> bw (KB  /s): min=4, max=30726, per=26.90%, avg=5456.84, stdev=3014.45
> lat (usec) : 750=0.01%, 1000=0.01%
> lat (msec) : 2=0.01%, 4=0.01%, 10=1.11%, 20=10.18%, 50=37.74%
> lat (msec) : 100=26.45%, 250=15.22%, 500=6.66%, 750=1.74%, 1000=0.55%
> lat (msec) : 2000=0.29%, >=2000=0.03%
>   cpu  : usr=0.36%, sys=2.31%, ctx=1148702, majf=0, minf=30
>   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.1%
>  issued: total=r=0/w=4194304/d=0, short=r=0/w=0/d=0
>  latency   : target=0, window=0, percentile=100.00%, depth=128
> 
> Run status group 0 (all jobs):
>   WRITE: io=16384MB, aggrb=20282KB/s, minb=20282KB/s, maxb=20282KB/s, 
> mint=827178msec, maxt=827178msec
> 
> Disk stats (read/write):
> dm-0: ios=0/4483641, merge=0/0, ticks=0/104928824, in_queue=105927128, 
> util=100.00%, aggrios=1/4469640, aggrmerge=0/14788, aggrticks=64/103711096, 
> aggrin_queue=104165356, aggrutil=100.00%
>   vda: ios=1/4469640, merge=0/14788, ticks=64/103711096, in_queue=104165356, 
> util=100.00%
> 
> ##
> 
> ### proxmox ve 4.x ###
> kvm --version
> QEMU emulator version 2.4.0.1 pve-qemu-kvm_2.4-12, Copyright (c) 2003-2008 
> Fabrice Bellard
> 
> grep ceph /etc/pve/qemu-server/102.conf 
> virtio1: ceph_test:vm-102-disk-1,cache=writeback,iothread=on,size=100G
> 
> root@fileserver-test:/daten/tv01/test# fio --time_based --name=benchmark 
> --size=4G --filename=/mnt/test.bin --ioengine=libaio --randrepeat=0 
> --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 
> --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting   
> fio: time_based requires a runtime/timeout setting
>
> benchmark: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, 
> iodepth=128  
> ...   
>   
> fio-2.1.11
> Starting 4 processes
> Jobs: 4 (f=4): [w(4)] [99.6% done] [0KB/56148KB/0KB /s] [0/14.4K/0 iops] [eta 
> 00m:01s]
> benchmark: (groupid=0, jobs=4): err= 0: pid=26131: Sun Nov 22 03:51:04 2015
>   write: io=0B, bw=59425KB/s, iops=14856, runt=282327msec
> slat (usec): min=6, max=216925, avg=261.78, stdev=1802.78
> clat (msec): min=1, max=330, avg=34.04, stdev=27.78
>  lat (msec): min=1, max=330, avg=34.30, stdev=27.87
> clat percentiles (msec):
>  |  1.00th=[   10],  5.00th=[   13], 

[ceph-users] SSD journals killed by VMs generating 500 IOPs (4kB) non-stop for a month, seemingly because of a syslog-ng bug

2015-11-22 Thread Alex Moore
I just had 2 of the 3 SSD journals in my small 3-node cluster fail 
within 24 hours of each other (not fun, although thanks to a replication 
factor of 3x, at least I didn't lose any data). The journals were 128 GB 
Samsung 850 Pros. However I have determined that it wasn't really their 
fault...


This is a small Ceph cluster running just a handful of relatively idle 
Qemu VMs using librbd for storage, and I had originally estimated that 
based on my low expected volume of write IO the Samsung 850 Pro journals 
would last at least 5 years (which would have been plenty). I still 
think that estimate was correct, but the reason they died prematurely 
(in reality they lasted 15 months) seems to have been that a number of 
my VMs had been hammering their disks continuously for almost a month, 
and I only noticed retrospectively after the journals had died. I 
tracked it back to some sort of bug in syslog-ng: the affected VMs took 
an update to syslog-ng on October 24th, and then ever since the 
following daily logrotate early on the 25th, the syslog daemons were 
together generating about 500 IOPs of 4kB writes continuously for the 
next 4 weeks until the journals then failed.


As a result, I reckon that taking write amplification into account the 
SSDs must have each written just over 1PB over that period - way more 
than they are supposed to be able to handle - so I can't blame the SSDs.


I do have graphs tracking various metrics for the Ceph cluster, 
including IOPs, latency, and read/write throughput - which is how I 
worked out what happened afterwards - but unfortunately I didn't have 
any alerting set up to warn me when there were anomalies in the graphs, 
and I wasn't proactively looking at the graphs on a regular basis.


So I think there is a lesson to be learned here... even if you have 
correctly spec'd your SSD journals in terms of endurance for the 
anticipated level of write activity in a cluster, it's still important 
to keep an eye on ensuring that the write activity matches expectations, 
as it's quite easy for a misbehaving VM to severely drain the life 
expectancy of SSDs by generating 4k write IOs as quickly as it can for a 
long period of time!


I have now replaced all 3 journals with 240 GB Samsung SM863 SSDs, which 
were only about twice the cost of the smaller 850 Pros. And I'm already 
noticing a massive performance improvement (reduction in write latency, 
and higher IOPs). So I'm not too upset about having unnecessarily killed 
the 850 Pros. But I thought it was worth sharing the experience...


FWIW the OSDs themselves are on 1TB Samsung 840 Evos, which I have been 
happy with so far (they've been going for about 18 months at this stage).


Alex

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD journals killed by VMs generating 500 IOPs (4kB) non-stop for a month, seemingly because of a syslog-ng bug

2015-11-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

There have been numerous on the mailing list of the Samsung EVO and
Pros failing far before their expected wear. This is most likely due
to the 'uncommon' workload of Ceph and the controllers of those drives
are not really designed to handle the continuous direct sync writes
that Ceph does. Because of this they can fail without warning
(controller failure rather than MLC failure).

We have tested the performance of the Micron M600 drives and as long
as you don't fill them up, they perform like the Intel line. I just
don't know if they will die prematurely like a lot of the Samsungs
have. We have a load of Intel S3500s that we can put in if they start
failing so I'm not too worried at the moment.

The only drives that I've heard really good things about are the Intel
S3700 (and I suspect the S3600 and S3500s could be used as well if you
take some additional precautions) and the Samsung DC PROs (has to have
the DC and PRO in the name). The Micron M600s are a good value and
have decent performance and I plan on keeping the list informed about
them as time goes on.

With a cluster that is idle as your's it may not make that much of a
difference. Where we are pushing 1,000s of IOPs all the time, we have
a challenge if the SSDs can't take the load.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.2.3
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJWUi06CRDmVDuy+mK58QAAJ/0P/36VLrKZqFu6ZnnSmjKn
m/0WoC7FbfIz3+B64q2Ftl9TlPR5PqtWm5PkBaj+VORYi8Rjbk3WVU9aipJC
Z3ok8G+JsVOoq5SEqQIPpcT10F9AHB54hIlD9WOSJInzq+ifUvRI9ZY3fkHU
y4zYDOdcDqeP1A1J1LSxMEwjX4FG8r8iE3jOfvJWf6K6ELnG3/Jn/vvKX+fM
AfdboCSAGkhkWVa+WBJg/SGx0fStElxgkIEaJUaKlWY+hSJiZcfT1WOEwZs9
3qbvgb5y8aEO1imLUAwB5gzyNG+JxWMUAby52yHF+Y3FQHeKQpcs9sGIgvct
ih6IQBeceU2BmLTUBugLdB5nLgBmPI/dCyw24JqwhynIqjkM8oX1i9V6x/F7
r+CsB/yabn6lTQn6yIIpzd3+jzMbZR6F4YiA7erpQjzs1grx4J/SQE2jDfoy
D9fAtB7sX8UFDJ6GyDnYeliiOpbIVK0sw4YKjT1Szu352dZ5rYioRvdDCJ8z
f799B+YsGcHMye0mi58Y1NLtTXsWIas2YFH05sR62UmqdG1ejKoqj5Af8j6x
3J3UQjSbB7V52uzFYxC1sd63XGfau3Oku59+vvb9MxmrFwEMtCqGlS/sKYP0
NhLG5wmV6vEfsgyzpUaTio+Fws0juj8GeUL6a9Yp1JaNGxN2HIyCFwD8FzTp
5oBU
=scws
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Sun, Nov 22, 2015 at 10:40 AM, Alex Moore  wrote:
> I just had 2 of the 3 SSD journals in my small 3-node cluster fail within 24
> hours of each other (not fun, although thanks to a replication factor of 3x,
> at least I didn't lose any data). The journals were 128 GB Samsung 850 Pros.
> However I have determined that it wasn't really their fault...
>
> This is a small Ceph cluster running just a handful of relatively idle Qemu
> VMs using librbd for storage, and I had originally estimated that based on
> my low expected volume of write IO the Samsung 850 Pro journals would last
> at least 5 years (which would have been plenty). I still think that estimate
> was correct, but the reason they died prematurely (in reality they lasted 15
> months) seems to have been that a number of my VMs had been hammering their
> disks continuously for almost a month, and I only noticed retrospectively
> after the journals had died. I tracked it back to some sort of bug in
> syslog-ng: the affected VMs took an update to syslog-ng on October 24th, and
> then ever since the following daily logrotate early on the 25th, the syslog
> daemons were together generating about 500 IOPs of 4kB writes continuously
> for the next 4 weeks until the journals then failed.
>
> As a result, I reckon that taking write amplification into account the SSDs
> must have each written just over 1PB over that period - way more than they
> are supposed to be able to handle - so I can't blame the SSDs.
>
> I do have graphs tracking various metrics for the Ceph cluster, including
> IOPs, latency, and read/write throughput - which is how I worked out what
> happened afterwards - but unfortunately I didn't have any alerting set up to
> warn me when there were anomalies in the graphs, and I wasn't proactively
> looking at the graphs on a regular basis.
>
> So I think there is a lesson to be learned here... even if you have
> correctly spec'd your SSD journals in terms of endurance for the anticipated
> level of write activity in a cluster, it's still important to keep an eye on
> ensuring that the write activity matches expectations, as it's quite easy
> for a misbehaving VM to severely drain the life expectancy of SSDs by
> generating 4k write IOs as quickly as it can for a long period of time!
>
> I have now replaced all 3 journals with 240 GB Samsung SM863 SSDs, which
> were only about twice the cost of the smaller 850 Pros. And I'm already
> noticing a massive performance improvement (reduction in write latency, and
> higher IOPs). So I'm not too upset about having unnecessarily killed the 850
> Pros. But I thought it was worth sharing the experience...
>
> FWIW the OSDs themselves are on 1TB