Quick correction/clarification about ZFS and large blocks - ZFS can and will 
write in 1MB or larger blocks but only with the latest versions with large 
block support enabled (which I am not sure if ZoL has), by default block 
aggregation is limited to 128KB. The rest of my post (about multiple vdevs, 
slog, etc) stands.



Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Sent: April-24-15 5:03 PM
To: J David; Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

The ZFS recordsize does NOT equal the size of the write to disk, ZFS will write 
to disk whatever size it feels is optimal. During a sequential write ZFS will 
easily write in 1MB blocks or greater. 

In a spinning-rust CEPH set up like yours, getting the most out of it will 
require higher io depths. In this case increasing the number of vdevs ZFS sees 
might help. Instead of a single vdev ontop of a single monolithic 32TB rbd 
volume, how about a striped ZFS set up with 8 vdevs ontop of 8 smaller 4TB rbd 

Also, what sort of SSD are you using for your ZIL/SLOG? Just like there are 
many bad SSDs for CEPH journal, many of the same performance guidelines apply 
to SIL/SLOG as well.


Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com

-----Original Message-----
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David
Sent: April-24-15 1:41 PM
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk <n...@fisk.me.uk> wrote:
> 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size 
> increases the number of iops will start to fall. You will probably get 
> around 70 iops for 128kb. But please benchmark your raw disks to get 
> some accurate numbers if needed.
> Next when you use on-disk journals you write 1st to the journal and 
> then write the actual data. There is also a small levelDB write which 
> stores ceph metadata so depending on IO size you will get slightly 
> less than half the native disk performance.
> You then have 2 copies, as Ceph won't ACK until both copies have been 
> written the average latency will tend to stray upwards.

What is the purpose of the journal if Ceph waits for the actual write to 
complete anyway?

I.e. with a hardware raid card with a BBU, the raid card tells the host that 
the data is guaranteed safe as soon as it has been written to the BBU.

Does this also mean that all the writing internal to ceph happens synchronously?

I.e. all these operations are serialized:

copy1-journal-write -> copy1-data-write -> copy2-journal-write -> 
copy2-data-write -> OK, client, you're done.

Since copy1 and copy2 are on completely different physical hardware, shouldn't 
those operations be able to proceed more or less independently?  And shouldn't 
the client be done as soon as the journal is written?  I.e.:

copy1-journal-write -v- copy1-data-write copy2-journal-write -|- 
                             +-> OK, client, you're done

If so, shouldn't the effective latency be that of one operation, not four?  
Plus all the non-trivial overhead for scheduling, LevelDB, network latency, etc.

For the "getting jackhammered by zillions of clients" case, your estimate 
probably holds more true, because even if writes aren't in the critical path 
they still happen and sooner or later the drive runs out of IOPs and things 
start getting in each others' way.  But for a single client, single thread case 
where the cluster is *not* 100% utilized, shouldn't the effective latency be 
much less?

The other thing about this that I don't quite understand, and the thing 
initially had me questioning whether there was something wrong on the Ceph side 
is that your estimate is based primarily on the mechanical capabilities of the 
drives.  Yet, in practice, when the Ceph cluster is tapped out for I/O in this 
situation, iostat says none of the physical drives are more than 10-20% busy 
and doing 10-20 IOPs to write a couple of MB/sec.  And those are the "loaded" 
ones at any given time.  Many are <10%.  In fact, *none* of the hardware on the 
Ceph side is anywhere close to fully utilized.  If the performance of this 
cluster is limited by its hardware, shouldn't there be some evidence of that 

To illustrate, I marked a physical drive out and waited for things to settle 
down, then ran fio on the physical drive (128KB randwrite
numjobs=1 iodepth=1).  It yields a very different picture of the drive's 
physical limits.

The drive during "maxxed out" client writes:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl               0.00     0.20    4.80   13.40    23.60  2505.65
277.94     0.26   14.07   16.08   13.34   6.68  12.16

The same drive under fio:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl               0.00     0.00    0.00  377.50     0.00 48320.00
256.00     0.99    2.62    0.00    2.62   2.62  98.72

You could make the argument that for we are seeing half the throughput on the 
same test because ceph is write-doubling (journal+data) and the reason no drive 
is highly utilized is because the load is being spread out.  So each of 28 
drives actually is being maxed out, but only 3.5% of the time, leading to low 
apparent utilization because the measurement interval is too long.  And maybe 
that is exactly what is happening.  For that to be true, the two OSD writes 
would indeed have to happen in parallel, not sequentially.  (Which is what it's 
supposed to do, I believe?)

But why does a client have to wait for both writes?  Isn't the journal enough?  
If it isn't, shouldn't it be?  And if it isn't, wouldn't moving to even an 
infinitely fast SSD journal only double the performance, since the second write 
still has to happen?

In case they are of interest, the native drive fio results are below.

testfile: (groupid=0, jobs=1): err= 0: pid=20562
  write: io=30720MB, bw=47568KB/s, iops=371 , runt=661312msec
    slat (usec): min=13 , max=4087 , avg=34.08, stdev=25.36
    clat (usec): min=2 , max=736605 , avg=2650.22, stdev=6368.02
     lat (usec): min=379 , max=736640 , avg=2684.80, stdev=6368.00
    clat percentiles (usec):
     |  1.00th=[  466],  5.00th=[ 1576], 10.00th=[ 1800], 20.00th=[ 1992],
     | 30.00th=[ 2128], 40.00th=[ 2224], 50.00th=[ 2320], 60.00th=[ 2416],
     | 70.00th=[ 2512], 80.00th=[ 2640], 90.00th=[ 2864], 95.00th=[ 3152],
     | 99.00th=[10688], 99.50th=[20352], 99.90th=[29056], 99.95th=[29568],
     | 99.99th=[452608]
    bw (KB/s)  : min= 1022, max=88910, per=100.00%, avg=47982.41, stdev=7115.74
    lat (usec) : 4=0.01%, 500=1.52%, 750=1.23%, 1000=0.14%
    lat (msec) : 2=17.32%, 4=76.47%, 10=1.41%, 20=1.40%, 50=0.49%
    lat (msec) : 100=0.01%, 250=0.01%, 500=0.01%, 750=0.01%
  cpu          : usr=0.56%, sys=1.21%, ctx=252044, majf=0, minf=21
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued    : total=r=0/w=245760/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=30720MB, aggrb=47567KB/s, minb=47567KB/s, maxb=47567KB/s, 
mint=661312msec, maxt=661312msec

Disk stats (read/write):
  sdl: ios=0/245789, merge=0/0, ticks=0/666944, in_queue=666556, util=98.28%

ceph-users mailing list
ceph-users mailing list
ceph-users mailing list

Reply via email to