Re: [ceph-users] EC Pool and Cache Tier Tuning

2015-03-10 Thread Steffen W Sørensen
On 09/03/2015, at 22.44, Nick Fisk n...@fisk.me.uk wrote:

 Either option #1 or #2 depending on if your data has hot spots or you need
 to use EC pools. I'm finding that the cache tier can actually slow stuff
 down depending on how much data is in the cache tier vs on the slower tier.
 
 Writes will be about the same speed for both solutions, reads will be a lot
 faster using a cache tier if the data resides in it.
Of course, a large cache tier miss rate would be a 'hit' on perf :)

Assuming that RBD client/OS page caching do help read OPs to some degree,
though memory can't cache as much data as a larger SSD.

/Steffen 

 
 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Steffen Winther
 Sent: 09 March 2015 20:47
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] EC Pool and Cache Tier Tuning
 
 Nick Fisk nick@... writes:
 
 My Ceph cluster comprises of 4 Nodes each with the following:- 10x 3TB
 WD Red Pro disks - EC pool k=3 m=3 (7200rpm) 2x S3700 100GB SSD's (20k
 Write IOPs) for HDD Journals 1x S3700 400GB SSD (35k Write IOPs) for
 cache tier - 3x replica
 If I have following 4x node config:
 
  2x S3700 200GB SSD's
  4x 4TB HDDs
 
 What config to aim for to optimize RBD write/read OPs:
 
  1x S3700 200GB SSD for 4x journals
  1x S3700 200GB cache tier
  4x 4TB HDD OSD disk
 
 or:
 
  2x S3700 200GB SSD for 2x journals
  4x 4TB HDD OSD disk
 
 or:
 
  2x S3700 200GB cache tier
  4x 4TB HDD OSD disk
 
 /Steffen
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC Pool and Cache Tier Tuning

2015-03-09 Thread Nick Fisk
Either option #1 or #2 depending on if your data has hot spots or you need
to use EC pools. I'm finding that the cache tier can actually slow stuff
down depending on how much data is in the cache tier vs on the slower tier.

Writes will be about the same speed for both solutions, reads will be a lot
faster using a cache tier if the data resides in it.

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Steffen Winther
 Sent: 09 March 2015 20:47
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] EC Pool and Cache Tier Tuning
 
 Nick Fisk nick@... writes:
 
  My Ceph cluster comprises of 4 Nodes each with the following:- 10x 3TB
  WD Red Pro disks - EC pool k=3 m=3 (7200rpm) 2x S3700 100GB SSD's (20k
  Write IOPs) for HDD Journals 1x S3700 400GB SSD (35k Write IOPs) for
  cache tier - 3x replica
 If I have following 4x node config:
 
   2x S3700 200GB SSD's
   4x 4TB HDDs
 
 What config to aim for to optimize RBD write/read OPs:
 
   1x S3700 200GB SSD for 4x journals
   1x S3700 200GB cache tier
   4x 4TB HDD OSD disk
 
 or:
 
   2x S3700 200GB SSD for 2x journals
   4x 4TB HDD OSD disk
 
 or:
 
   2x S3700 200GB cache tier
   4x 4TB HDD OSD disk
 
 /Steffen
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] EC Pool and Cache Tier Tuning

2015-03-09 Thread Steffen Winther
Nick Fisk nick@... writes:

 My Ceph cluster comprises of 4 Nodes each with the following:-
 10x 3TB WD Red Pro disks - EC pool k=3 m=3 (7200rpm)
 2x S3700 100GB SSD's (20k Write IOPs) for HDD Journals
 1x S3700 400GB SSD (35k Write IOPs) for cache tier - 3x replica 
If I have following 4x node config:

  2x S3700 200GB SSD's
  4x 4TB HDDs

What config to aim for to optimize RBD write/read OPs:

  1x S3700 200GB SSD for 4x journals
  1x S3700 200GB cache tier
  4x 4TB HDD OSD disk

or:

  2x S3700 200GB SSD for 2x journals
  4x 4TB HDD OSD disk

or:

  2x S3700 200GB cache tier
  4x 4TB HDD OSD disk

/Steffen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] EC Pool and Cache Tier Tuning

2015-03-07 Thread Nick Fisk
Hi All,

I have been experimenting with EC pools and Cache Tiers to make them more
useful for more active data sets on RBD volumes and I thought I would share
my findings so far as they have made quite a significant difference.

My Ceph cluster comprises of 4 Nodes each with the following:-
12x2.1ghx Xeon cores
32GB Ram
2x 10GB Networking ALB bonded
10x 3TB WD Red Pro disks - EC pool k=3 m=3 (7200rpm)
2x S3700 100GB SSD's (20k Write IOPs) for HDD Journals
1x S3700 400GB SSD (35k Write IOPs) for cache tier - 3x replica 

One thing I noticed with default settings is that when encountering a cache
miss performance dropped significantly, to a level far below that of the
7200rpm disks. I also noticed that despite having around 500GB of cache tier
available, I was getting a lot more misses than I would expect for the
amount of data I had on the pool.

I had a theory that the default RBD block size of 4MB was the cause of both
these anomalies. To put this to the test I created a test EC pool and cache
pool and did some testing with different RBD order values.

First though, I performed some basic benchmarking on one of the 7200RPM
disks and came up with the following (all values at IO depth of 1):-

IO Size IOPs
4MB 25
1MB 52
256KB   73
64KB81
4KB 83

As you can see random IO performance really starts to drop off once you go
above about 256kb IO size and below that there is diminishing returns as
bandwidth drops of dramatically with lower IO sizes.

When using an EC pool each object is split into shards and stored on each
disk. I'm using a k=3 EC pool, so that will mean a 4MB object will be split
into about 1.3MB shards,  which as can be see above is really not the best
size of IO for random performance.

I then created 4 RBD's with the following object sizes 4MB,2MB,1MB and
512KB, filled them with data, evicted the cache pool and then using Fio,
performed random read 64kb IO's. Results as follows:-

RBD Obj SizeShard Size  64k IOPs
4MB 1.33MB  24
2MB 0.66MB  38
1MB .33MB   51
512K.17MB   58

As can be seen there is a doubling of random IO performance between 4MB and
1MB object sizes and looking at the shard sizes, this correlates quite
nicely with the disk benchmarks I did earlier. Going to a 512kb Object size,
does improve performance but it is starting to tail off.

The other benefit of using a smaller object size seems to be that the cache
tier is much more effective at caching hot blocks as a single IO
promotes/evicts less data. I don't have any numbers on this yet, but I will
try and get Fio to create a hot spot pattern so I can generate some reliable
figures. But from just using the RBD's it certainly feels like the cache is
doing a much better job with 1MB object sizes.

Incidentally I looked at some RAID 6 write benchmarks with varying chunk
sizes as when doing a write you have to read the whole stripe back. Most of
these benchmarks also show performance dropping off past 256/512kb chunk
sizes.

The next thing I tried was to change the read_expire parameter of the
deadline scheduler to 30ms to make sure that reads are prioritised even more
than default. Again I don't have numbers for this yet, but watching iostat
seems to show that reads are happening with a much more predictable latency.
Delaying the writes should not be such a problem as the journals buffer
them.

To summarize, trying to get your EC shards to be around 256kb seems to
improve random IO at the cost of some bandwidth. If your EC pool has data
that is rarely accessed or only does large block IOs then the default size
probably won't have much of an impact. There is also the fact that you will
now have 4 times more objects, I'm not sure what the impact of this is,
maybe someone can comment?

If anyone notices that I have made any mistakes in my tests of assumptions,
please let me know.

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com