Re: [ceph-users] multiple journals on SSD

2016-07-13 Thread George Shuklin

Hello.

On 07/13/2016 03:31 AM, Christian Balzer wrote:

Hello,

did you actually read my full reply last week, the in-line parts,
not just the top bit?

http://www.spinics.net/lists/ceph-users/msg29266.html

On Tue, 12 Jul 2016 16:16:09 +0300 George Shuklin wrote:


Yes, linear io speed was concern during benchmark. I can not predict how
much linear IO would be generated by clients (compare to IOPS) so we
going to balance HDD-OSD per SSD according to real usage. If users would
generate too much random IO, we will raise HDD/SSD ratio, if they would
generate more linear-write load, we will reduce that number. I plan to
do it by reserving space for 'more HDD' or 'more SSD' in planned servers
- they will go to production with ~50% slot utilization.


Journal writes are always "linear", in a fashion.
And Ceph journals only sees writes, never reads.

So what your SSD sees is n sequential (with varying lengths, mind ya)
write streams and that's all.
Where n is the number of journals.


Yes, I knew this. I mean that under real load in production there going 
to be too much random IO (directed toward OSD), that HDD inside of each 
OSD would not be able to accept too much of linear writing (it needs to 
serve random write & read in parallel, and this significantly reduce 
speed of linear IO on the same device). I expect HDD to be busy enough 
with random write/read without saturating SSD linear write performance.

My main concern is that random IO for OSD includes not only writes, but
reads too, and cold random read will slower HDD performance
significantly. On my previous experience, any weekly cronjob on server
with backup (or just 'find /') cause bad spikes of cold read, and that
drastically diminish HDD performance.


As I wrote last week, reads have nothing to do with journals.


Read have nothing to do with journals except one thing: if underlying 
HDD is very busy with read, it will accept write operations slowly. And 
because it will accept them slowly, SSD with journal would not get too 
much IO of any kind.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread Christian Balzer

Hello,

did you actually read my full reply last week, the in-line parts,
not just the top bit?

http://www.spinics.net/lists/ceph-users/msg29266.html

On Tue, 12 Jul 2016 16:16:09 +0300 George Shuklin wrote:

> Yes, linear io speed was concern during benchmark. I can not predict how 
> much linear IO would be generated by clients (compare to IOPS) so we 
> going to balance HDD-OSD per SSD according to real usage. If users would 
> generate too much random IO, we will raise HDD/SSD ratio, if they would 
> generate more linear-write load, we will reduce that number. I plan to 
> do it by reserving space for 'more HDD' or 'more SSD' in planned servers 
> - they will go to production with ~50% slot utilization.
> 
Journal writes are always "linear", in a fashion.
And Ceph journals only sees writes, never reads.

So what your SSD sees is n sequential (with varying lengths, mind ya)
write streams and that's all.
Where n is the number of journals.

> My main concern is that random IO for OSD includes not only writes, but 
> reads too, and cold random read will slower HDD performance 
> significantly. On my previous experience, any weekly cronjob on server 
> with backup (or just 'find /') cause bad spikes of cold read, and that 
> drastically diminish HDD performance.
>
As I wrote last week, reads have nothing to do with journals.

To improve random, cold reads your options are:

1. Enough RAM on the OSD storage nodes to hold all dentries and other SLAB
bits in memory, this will dramatically reduce seeks.
2. Cache tiering, correctly configured and sized of course.
3. Read-ahead settings with RBD or your client VMs.

Lastly, anything that keeps writes and reads competing for the few HDD
IOPS there are, so journal SSDs, controller HW caches and again cache
pools. 
 
Christian

> (TL;DR; I don't belive we ever would have conditions when HDD can give 
> few dozens of MB/s of writing).
> 
> Thank you for advice.
> 
> On 07/12/2016 04:03 PM, Vincent Godin wrote:
> > Hello.
> >
> > I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> > stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> > sometime do not appear after partition creation). And I'm thinking that
> > partition is not that useful for OSD management, because linux do no
> > allow partition rereading with it contains used volumes.
> >
> > So my question: How you store many journals on SSD? My initial thoughts:
> >
> > 1)  filesystem with filebased journals
> > 2) LVM with volumes
> >
> > Anything else? Best practice?
> >
> > P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> >
> > Hello,
> >
> > I would like to advertise you not using 1 SSD for 16 HDD. Ceph journal 
> > is not only a journal but a write cache during operation. I had that 
> > kind of configuration with 1 SSD for 20  SATA HDD. With a Ceph bench, 
> > i notice that my rate whas limited between 350 and 400 MB/s. In fact, 
> > a iostat show me that my SSD was 100% utilised with a rate of 350-400 
> > MB/s.
> >
> > If you consider that a SATA HDD can have a max average rate of 100 
> > MB/s, you need to configure one SSD (which can rate till 400 MB/s) for 
> > 4 SATA HDD
> >
> > Vincent
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread Udo Lembke
Hi Vincent,

On 12.07.2016 15:03, Vincent Godin wrote:
> Hello.
>
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> sometime do not appear after partition creation). And I'm thinking that
> partition is not that useful for OSD management, because linux do no
> allow partition rereading with it contains used volumes.
>
> So my question: How you store many journals on SSD? My initial thoughts:
>
> 1)  filesystem with filebased journals
> 2) LVM with volumes
1+2 has an performance impact.
I do an trick and use partition labels for the journal.
[osd]
osd_journal = /dev/disk/by-partlabel/journal-$id

Due this i'm independed from linux device-naming.


Udo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread George Shuklin
Yes, linear io speed was concern during benchmark. I can not predict how 
much linear IO would be generated by clients (compare to IOPS) so we 
going to balance HDD-OSD per SSD according to real usage. If users would 
generate too much random IO, we will raise HDD/SSD ratio, if they would 
generate more linear-write load, we will reduce that number. I plan to 
do it by reserving space for 'more HDD' or 'more SSD' in planned servers 
- they will go to production with ~50% slot utilization.


My main concern is that random IO for OSD includes not only writes, but 
reads too, and cold random read will slower HDD performance 
significantly. On my previous experience, any weekly cronjob on server 
with backup (or just 'find /') cause bad spikes of cold read, and that 
drastically diminish HDD performance.


(TL;DR; I don't belive we ever would have conditions when HDD can give 
few dozens of MB/s of writing).


Thank you for advice.

On 07/12/2016 04:03 PM, Vincent Godin wrote:

Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD. I
stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
sometime do not appear after partition creation). And I'm thinking that
partition is not that useful for OSD management, because linux do no
allow partition rereading with it contains used volumes.

So my question: How you store many journals on SSD? My initial thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.

Hello,

I would like to advertise you not using 1 SSD for 16 HDD. Ceph journal 
is not only a journal but a write cache during operation. I had that 
kind of configuration with 1 SSD for 20  SATA HDD. With a Ceph bench, 
i notice that my rate whas limited between 350 and 400 MB/s. In fact, 
a iostat show me that my SSD was 100% utilised with a rate of 350-400 
MB/s.


If you consider that a SATA HDD can have a max average rate of 100 
MB/s, you need to configure one SSD (which can rate till 400 MB/s) for 
4 SATA HDD


Vincent


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread Guillaume Comte
2016-07-12 15:03 GMT+02:00 Vincent Godin :

> Hello.
>
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> sometime do not appear after partition creation). And I'm thinking that
> partition is not that useful for OSD management, because linux do no
> allow partition rereading with it contains used volumes.
>
On my side if i launch "partprobe" after creating with fdisk on a disk
which has mounted partitions, then it works



>
> So my question: How you store many journals on SSD? My initial thoughts:
>
> 1)  filesystem with filebased journals
> 2) LVM with volumes
>
> Anything else? Best practice?
>
> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
>
> Hello,
>
> I would like to advertise you not using 1 SSD for 16 HDD. Ceph journal is
> not only a journal but a write cache during operation. I had that kind of
> configuration with 1 SSD for 20  SATA HDD. With a Ceph bench, i notice that
> my rate whas limited between 350 and 400 MB/s. In fact, a iostat show me
> that my SSD was 100% utilised with a rate of 350-400 MB/s.
>
> If you consider that a SATA HDD can have a max average rate of 100 MB/s,
> you need to configure one SSD (which can rate till 400 MB/s) for 4 SATA HDD
>
> Vincent
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
*Guillaume Comte*
06 25 85 02 02  | guillaume.co...@blade-group.com

90 avenue des Ternes, 75 017 Paris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-12 Thread Vincent Godin
Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD. I
stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
sometime do not appear after partition creation). And I'm thinking that
partition is not that useful for OSD management, because linux do no
allow partition rereading with it contains used volumes.

So my question: How you store many journals on SSD? My initial thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.

Hello,

I would like to advertise you not using 1 SSD for 16 HDD. Ceph journal is
not only a journal but a write cache during operation. I had that kind of
configuration with 1 SSD for 20  SATA HDD. With a Ceph bench, i notice that
my rate whas limited between 350 and 400 MB/s. In fact, a iostat show me
that my SSD was 100% utilised with a rate of 350-400 MB/s.

If you consider that a SATA HDD can have a max average rate of 100 MB/s,
you need to configure one SSD (which can rate till 400 MB/s) for 4 SATA HDD

Vincent
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-08 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Zoltan Arnold Nagy
> Sent: 08 July 2016 08:51
> To: Christian Balzer <ch...@gol.com>
> Cc: ceph-users <ceph-users@lists.ceph.com>; n...@fisk.me.uk
> Subject: Re: [ceph-users] multiple journals on SSD
> 
> Hi Christian,
> 
> 
> On 08 Jul 2016, at 02:22, Christian Balzer <mailto:ch...@gol.com> wrote:
> 
> 
> Hello,
> 
> On Thu, 7 Jul 2016 23:19:35 +0200 Zoltan Arnold Nagy wrote:
> 
> 
> Hi Nick,
> 
> How large NVMe drives are you running per 12 disks?
> 
> In my current setup I have 4xP3700 per 36 disks but I feel like I could
> get by with 2… Just looking for community experience :-)
> This is funny, because you ask Nick about the size and don't mention it
> yourself. ^o^
> 
> You are absolutely right, my bad. We are using the 400GB models.
> 
> 
> As I speculated in my reply, it's the 400GB model and Nick didn't dispute
> that.
> And I shall assume the same for you.
> 
> You could get by with 2 of the 400GB ones, but that depends on a number of
> things.
> 
> 1. What's your use case, typical usage pattern?
> Are you doing a lot of large sequential writes or is it mostly smallish
> I/Os?
> HDD OSDs will clock in at about 100MB/s with OSD bench, but realistically
> not see more than 50-60MB/s, so with 18 of them per one 400GB P3700 you're
> about on par.
> 
> Our usage varies so much that it’s hard to put a finger on it.
> Some days it’s this, some days it’s that. Internal cloud with att bunch of 
> researchers.

What I have seen is that where something like a SAS/SATA SSD will almost have a 
linear response of latency against load, NVME's start off with a shallower 
curve. You probably want to look at how high your current journals are getting 
hit. If they are much above 25-50% I would hesitate about putting too much more 
load on them for latency reasons, unless you are just going for big buffered 
write performance. You could probably drop down to maybe using 3 for every 12 
disks though?

This set of slides were very interesting when I was planning my latest nodes.

https://indico.cern.ch/event/320819/contributions/742938/attachments/618990/851639/SSD_Benchmarking_at_CERN__HEPiX_Fall_2014.pdf


> 
> 
> 
> 2. What's your network setup? If you have more than 20Gb/s to that node,
> your journals will likely become the (write) bottleneck.
> But that's only the case with backfills or again largish sequential writes
> of course.
> Currently it’s bonded (LACP) 2x10Gbit for both the front and backend, but 
> soon going to
> upgrade to 4x10Gbit front and 2x100Gbit back. (Already have a test cluster 
> with this setup).
> 
> 
> 3. A repeat of sorts of the previous 2 points, this time with the focus on
> endurance. How much data are you writing per day to an average OSD?
> With 18 OSDs per 400GB P3700 NVMe you will want that to be less than
> 223GB/day/OSD.
> 
> We’re growing at around 100TB/month spread over ~130 OSDs at the moment which 
> gives me ~25GB/OSD
> (I wish it would be that uniformly distributed :))
> 
> 
> 4. As usual, failure domains. In the case of a NVMe failure you'll loose
> twice the amount of OSDs.
> Right, but having a lot of nodes (20+) mitigates this somewhat.
> 
> 
> That all being said, at 36 OSDs I'd venture you'll run out of CPU steam
> (with small write IOPS) before your journals become the bottleneck.
> I agree, but that has not been the case so far.
> 
> 
> Christian
> 
> 
> Cheers,
> Zoltan
> [snip]
> 
> 
> --
> Christian BalzerNetwork/Systems Engineer
> mailto:ch...@gol.com  Global OnLine Japan/Rakuten Communications
> http://www.gol.com/


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-08 Thread Zoltan Arnold Nagy
Hi Christian,On 08 Jul 2016, at 02:22, Christian Balzer  wrote:Hello,On Thu, 7 Jul 2016 23:19:35 +0200 Zoltan Arnold Nagy wrote:Hi Nick,How large NVMe drives are you running per 12 disks?In my current setup I have 4xP3700 per 36 disks but I feel like I couldget by with 2… Just looking for community experience :-)This is funny, because you ask Nick about the size and don't mention ityourself. ^o^You are absolutely right, my bad. We are using the 400GB models.As I speculated in my reply, it's the 400GB model and Nick didn't disputethat.And I shall assume the same for you.You could get by with 2 of the 400GB ones, but that depends on a number ofthings.1. What's your use case, typical usage pattern?Are you doing a lot of large sequential writes or is it mostly smallishI/Os? HDD OSDs will clock in at about 100MB/s with OSD bench, but realisticallynot see more than 50-60MB/s, so with 18 of them per one 400GB P3700 you'reabout on par.Our usage varies so much that it’s hard to put a finger on it.Some days it’s this, some days it’s that. Internal cloud with att bunch of researchers.2. What's your network setup? If you have more than 20Gb/s to that node,your journals will likely become the (write) bottleneck. But that's only the case with backfills or again largish sequential writesof course.Currently it’s bonded (LACP) 2x10Gbit for both the front and backend, but soon going toupgrade to 4x10Gbit front and 2x100Gbit back. (Already have a test cluster with this setup).3. A repeat of sorts of the previous 2 points, this time with the focus onendurance. How much data are you writing per day to an average OSD?With 18 OSDs per 400GB P3700 NVMe you will want that to be less than223GB/day/OSD.We’re growing at around 100TB/month spread over ~130 OSDs at the moment which gives me ~25GB/OSD(I wish it would be that uniformly distributed :))4. As usual, failure domains. In the case of a NVMe failure you'll loosetwice the amount of OSDs.Right, but having a lot of nodes (20+) mitigates this somewhat.That all being said, at 36 OSDs I'd venture you'll run out of CPU steam(with small write IOPS) before your journals become the bottleneck.I agree, but that has not been the case so far.ChristianCheers,Zoltan[snip]-- Christian Balzer    Network/Systems Engineer    ch...@gol.com   	Global OnLine Japan/Rakuten Communicationshttp://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-07 Thread Christian Balzer

Hello,

On Thu, 7 Jul 2016 23:19:35 +0200 Zoltan Arnold Nagy wrote:

> Hi Nick,
> 
> How large NVMe drives are you running per 12 disks?
> 
> In my current setup I have 4xP3700 per 36 disks but I feel like I could
> get by with 2… Just looking for community experience :-)
>
This is funny, because you ask Nick about the size and don't mention it
yourself. ^o^

As I speculated in my reply, it's the 400GB model and Nick didn't dispute
that.
And I shall assume the same for you.

You could get by with 2 of the 400GB ones, but that depends on a number of
things.

1. What's your use case, typical usage pattern?
Are you doing a lot of large sequential writes or is it mostly smallish
I/Os? 
HDD OSDs will clock in at about 100MB/s with OSD bench, but realistically
not see more than 50-60MB/s, so with 18 of them per one 400GB P3700 you're
about on par.
  

2. What's your network setup? If you have more than 20Gb/s to that node,
your journals will likely become the (write) bottleneck. 
But that's only the case with backfills or again largish sequential writes
of course.

3. A repeat of sorts of the previous 2 points, this time with the focus on
endurance. How much data are you writing per day to an average OSD?
With 18 OSDs per 400GB P3700 NVMe you will want that to be less than
223GB/day/OSD.

4. As usual, failure domains. In the case of a NVMe failure you'll loose
twice the amount of OSDs.


That all being said, at 36 OSDs I'd venture you'll run out of CPU steam
(with small write IOPS) before your journals become the bottleneck.
 
Christian

> Cheers,
> Zoltan
> 
[snip]


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-07 Thread Zoltan Arnold Nagy
Hi Nick,

How large NVMe drives are you running per 12 disks?

In my current setup I have 4xP3700 per 36 disks but I feel like I could get by 
with 2… Just looking for community experience :-)

Cheers,
Zoltan

> On 07 Jul 2016, at 10:45, Nick Fisk <n...@fisk.me.uk> wrote:
> 
> Just to add if you really want to go with lots of HDD's to Journals then go
> NVME. They are not a lot more expensive than the equivalent SATA based
> 3700's, but the latency is low low low. Here is an example of a node I have
> just commissioned with 12 HDD's to one P3700
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sdb   0.00 0.00   68.000.00  8210.00 0.00   241.47
> 0.263.853.850.00   2.09  14.20
> sdd   2.50 0.00  198.50   22.00 24938.00  9422.00   311.66
> 4.34   27.806.21  222.64   2.45  54.00
> sdc   0.00 0.00   63.000.00  7760.00 0.00   246.35
> 0.152.162.160.00   1.56   9.80
> sda   0.00 0.00   61.50   47.00  7600.00 22424.00   553.44
> 2.77   25.572.63   55.57   3.82  41.40
> nvme0n1   0.0022.502.00 2605.00 8.00 139638.00   107.13
> 0.140.050.000.05   0.03   6.60
> sdg   0.00 0.00   61.00   28.00  6230.00 12696.00   425.30
> 3.66   74.795.84  225.00   3.87  34.40
> sdf   0.00 0.00   34.50   47.00  4108.00 21702.00   633.37
> 3.56   43.751.51   74.77   2.85  23.20
> sdh   0.00 0.00   75.00   15.50  9180.00  4984.00   313.02
> 0.45   12.553.28   57.42   3.51  31.80
> sdi   1.50 0.50  142.00   48.50 18102.00 21924.00   420.22
> 3.60   18.924.99   59.71   2.70  51.40
> sdj   0.50 0.00   74.505.00  9362.00  1832.00   281.61
> 0.334.103.33   15.60   2.44  19.40
> sdk   0.00 0.00   54.000.00  6420.00 0.00   237.78
> 0.122.302.300.00   1.70   9.20
> sdl   0.00 0.00   21.001.50  2286.0016.00   204.62
> 0.32   18.13   13.81   78.67   6.67  15.00
> sde   0.00 0.00   98.000.00 12304.00 0.00   251.10
> 0.303.103.100.00   2.08  20.40
> 
> 50us latency at 2605 iops!!!
> 
> Compared to one of the other nodes with 2 100GB S3700's, 6 disks each
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> sda   0.0030.500.00  894.50 0.00 50082.00   111.98
> 0.360.410.000.41   0.20  17.80
> sdb   0.00 9.000.00  551.00 0.00 32044.00   116.31
> 0.230.420.000.42   0.19  10.40
> sdc   0.00 2.006.50   17.50   278.00  8422.00   725.00
> 1.08   44.92   18.46   54.74   8.08  19.40
> sdd   0.00 0.000.000.00 0.00 0.00 0.00
> 0.000.000.000.00   0.00   0.00
> sde   0.00 2.50   27.50   21.50  2112.00  9866.00   488.90
> 0.59   12.046.91   18.60   6.53  32.00
> sdf   0.50 0.00   50.500.00  6170.00 0.00   244.36
> 0.184.634.630.00   2.10  10.60
> md1   0.00 0.000.000.00 0.00 0.00 0.00
> 0.000.000.000.00   0.00   0.00
> md0   0.00 0.000.000.00 0.00 0.00 0.00
> 0.000.000.000.00   0.00   0.00
> sdg   0.00 1.50   32.00  386.50  3970.00 12188.0077.22
> 0.150.350.500.34   0.15   6.40
> sdh   0.00 0.006.000.0034.00 0.0011.33
> 0.07   12.67   12.670.00  11.00   6.60
> sdi   0.00 0.501.50   19.50 6.00  8862.00   844.57
> 0.96   45.71   33.33   46.67   6.57  13.80
> sdj   0.00 0.00   67.000.00  8214.00 0.00   245.19
> 0.172.512.510.00   1.88  12.60
> sdk   1.50 2.50   61.00   48.00  6216.00 21020.00   499.74
> 2.01   18.46   11.41   27.42   5.05  55.00
> sdm   0.00 0.00   30.500.00  3576.00 0.00   234.49
> 0.072.432.430.00   1.90   5.80
> sdl   0.00 4.50   25.00   23.50  2092.00 12648.00   607.84
> 1.36   19.425.60   34.13   4.04  19.60
> sdn   0.50 0.00   23.000.00  2670.00 0.00   232.17
> 0.072.962.960.00   2.43   5.60
> 
> Pretty much 10x the latency. I'm seriously impressed with these NVME things.
> 
> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
>> Christian Balzer
>> Sent: 07 July 2016 03:23
>> To: ceph-users@list

Re: [ceph-users] multiple journals on SSD

2016-07-07 Thread Nick Fisk
Hi Christian,

> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: 07 July 2016 12:57
> To: ceph-users@lists.ceph.com
> Cc: Nick Fisk <n...@fisk.me.uk>
> Subject: Re: [ceph-users] multiple journals on SSD
> 
> 
> Hello Nick,
> 
> On Thu, 7 Jul 2016 09:45:58 +0100 Nick Fisk wrote:
> 
> > Just to add if you really want to go with lots of HDD's to Journals
> > then go NVME. They are not a lot more expensive than the equivalent
> > SATA based 3700's, but the latency is low low low. Here is an example
> > of a node I have just commissioned with 12 HDD's to one P3700
> >
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> > avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> > sdb   0.00 0.00   68.000.00  8210.00 0.00
> > 241.47 0.263.853.850.00   2.09  14.20
> > sdd   2.50 0.00  198.50   22.00 24938.00  9422.00
> > 311.66 4.34   27.806.21  222.64   2.45  54.00
> > sdc   0.00 0.00   63.000.00  7760.00 0.00
> > 246.35 0.152.162.160.00   1.56   9.80
> > sda   0.00 0.00   61.50   47.00  7600.00 22424.00
> > 553.44 2.77   25.572.63   55.57   3.82  41.40
> > nvme0n1   0.0022.502.00 2605.00 8.00 139638.00
> > 107.13 0.140.050.000.05   0.03   6.60
> > sdg   0.00 0.00   61.00   28.00  6230.00 12696.00
> > 425.30 3.66   74.795.84  225.00   3.87  34.40
> > sdf   0.00 0.00   34.50   47.00  4108.00 21702.00
> > 633.37 3.56   43.751.51   74.77   2.85  23.20
> > sdh   0.00 0.00   75.00   15.50  9180.00  4984.00
> > 313.02 0.45   12.553.28   57.42   3.51  31.80
> > sdi   1.50 0.50  142.00   48.50 18102.00 21924.00
> > 420.22 3.60   18.924.99   59.71   2.70  51.40
> > sdj   0.50 0.00   74.505.00  9362.00  1832.00
> > 281.61 0.334.103.33   15.60   2.44  19.40
> > sdk   0.00 0.00   54.000.00  6420.00 0.00
> > 237.78 0.122.302.300.00   1.70   9.20
> > sdl   0.00 0.00   21.001.50  2286.0016.00
> > 204.62 0.32   18.13   13.81   78.67   6.67  15.00
> > sde   0.00 0.00   98.000.00 12304.00 0.00
> > 251.10 0.303.103.100.00   2.08  20.40
> >
> Is that a live sample from iostat or the initial/one-shot summary?

1st of all, apologies for the formatting, that looks really ugly above, fixed 
now. Iostat had been running for a while, I just copied one of the sections, so 
live.

> 
> > 50us latency at 2605 iops!!!
> >
> At less than 5% IOPS or 14% bandwidth capacity running more than twice as 
> slow than the spec sheet says. ^o^ Fast, very much so.
> But not mindnumbingly so.
> 
> The real question here is, how much of that latency improvement do you see in 
> the Ceph clients, VMs?
> 
> I'd venture not so much, given that most latency happens in Ceph.

Admittedly not much, but it's very hard to tell as its only 1/5th of the 
cluster. Looking at graphs in graphite, I can see the filestore journal latency 
is massively lower. The subop latency is somewhere between a 1/2 to 3/4 of the 
older nodes. At higher queue depths the NVME device is always showing at least 
1ms lower latency, so it must be having a positive effect.

My new cluster which should be going live in a couple of weeks, will be 
comprised of just these node types so I will have a better idea then. Also they 
will have 4x3.9Ghz CPU's which go a long way to reducing latency as well. I'm 
aiming for ~1ms at the client for a 4kb write.

> 
> That all said, I'd go for a similar setup as well, if I had a dozen storage 
> nodes or more.
> But at my current cluster sizes that's too many eggs in one basket for me.

Yeah, I'm only at 5 nodes, but decided that having a cold spare on hand 
justified the risk for the intended use (backups)

> My "largest" cluster is now up to node 5, going from 4 journal SSDs for 8 
> HDDs to 2 journal SSDs for 12 HDDs. Woo-Woo!
> 
> > Compared to one of the other nodes with 2 100GB S3700's, 6 disks each
> >
> Well, that's not really fair, is it?
> 
> Those SSDs have a 5 times lower bandwidth, triple the write latency and the 
> SATA bus instead of the PCIe zipway when compared to
> the smallest P 3700.
> 
> And 6 disk are a bit much for that SSD, 4 would be pushing it.
> Whereas 12 HDDs for the P model are a good match, overkill really.

No, good point, but it demonstrates the change in my mindset from 2 years ago 
that I think most newbies to Ceph also go through. Back then I was like "SSD, 
wow fa

Re: [ceph-users] multiple journals on SSD

2016-07-07 Thread Christian Balzer
10.00   1.88  12.60
> sdk   1.50 2.50   61.00   48.00  6216.00 21020.00
> 499.74 2.01   18.46   11.41   27.42   5.05  55.00
> sdm   0.00 0.00   30.500.00  3576.00 0.00
> 234.49 0.072.432.430.00   1.90   5.80
> sdl   0.00 4.50   25.00   23.50  2092.00 12648.00
> 607.84 1.36   19.425.60   34.13   4.04  19.60
> sdn   0.50 0.00   23.000.00  2670.00 0.00
> 232.17 0.072.962.960.00   2.43   5.60
> 
> Pretty much 10x the latency. I'm seriously impressed with these NVME
> things.
> 
> 
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Christian Balzer
> > Sent: 07 July 2016 03:23
> > To: ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] multiple journals on SSD
> > 
> > 
> > Hello,
> > 
> > I have a multitude of of problems with the benchmarks and conclusions
> here,
> > more below.
> > 
> > But firstly to address the question of the OP, definitely not
> > filesystem
> based
> > journals.
> > Another layer of overhead and delays, something I'd be willing to
> > ignore
> if
> > we're talking about a full SSD as OSD with an inline journal, but not
> > with journal SSDs.
> > Similar with LVM, though with a lower impact.
> > 
> > Partitions really are your best bet.
> > 
> > On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:
> > 
> > > Yes.
> > >
> > > On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL
> > > SSDSC2BB800G4 (800G, 9 journals)
> > 
> > First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of
> > good journal device, even if it had the performance.
> > If you search in the ML archives there is at least one case where
> > somebody lost a full storage node precisely because their DC S3500s
> > were worn out:
> > https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28083.html
> > 
> > Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
> > price) would be a better deal, at 50% more endurance and only slightly
> lower
> > sequential write speed.
> > 
> > And depending on your expected write volume (which you should
> > know/estimate as close as possible before buying HW), a 400GB DC S3710
> > might be the best deal when it comes to TBW/$.
> > It's 30% more expensive than your 3510, but has the same speed and an
> > endurance that's 5 times greater.
> > 
> > > during random write I got ~90%
> > > utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
> > > linear writing it somehow worse: I got 250Mb/s on SSD, which
> > > translated to 240Mb of all OSD combined.
> > >
> > This test shows us a lot of things, mostly the failings of filestore.
> > But only partially if a SSD is a good fit for journals or not.
> > 
> > How are you measuring these things on the storage node, iostat, atop?
> > At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
> > about/over 50% utilization, given that its top speed is 460MB/s.
> > 
> > With Intel DC SSDs you can pretty much take the sequential write speed
> from
> > their specifications page and roughly expect that to be the speed of
> > your journal.
> > 
> > For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain
> > SATA HDDs will give us this when running "ceph tell osd.nn bench" in
> > parallel against 2 OSDs that share a journal SSD:
> > ---
> > Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> > avgrq-sz
> avgqu-sz
> > await r_await w_await  svctm  %util
> > sdd   0.00 2.000.00  409.50 0.00 191370.75
> 934.66   146.52  356.46
> > 0.00  356.46   2.44 100.00
> > sdl   0.0085.500.50  120.50 2.00 49614.00
> > 820.10
> 2.25   18.51
> > 0.00   18.59   8.20  99.20
> > sdk   0.0089.501.50  119.00 6.00 49348.00
> > 819.15
> 2.04   16.91
> > 0.00   17.13   8.23  99.20
> > ---
> > 
> > Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
> > And the SSD is nearly at 200MB/s (and 100%).
> > For the record, that bench command is good for testing, but the result:
> > ---
> > # ceph tell osd.30 bench
> > {
> > "bytes_written": 1073741824,
> > "blocksize": 4194304,
> > "bytes_per_sec": 100960114.00
> > }
> > ---
> > should be ta

Re: [ceph-users] multiple journals on SSD

2016-07-07 Thread George Shuklin

The are two problems I found so far:

1) You can not alter parition table if it is in the use. That means you 
need to stop all ceph-osd who use journals on given OSD to change 
anything on it. Worse: you can change, but you can not force kernel to 
reread partition table.
2) I found udev bug with 5th and more partion detection. Basically, 
after you create 4 GPT-based parition, and after that create 5th, UDEV 
do not create /dev/sdx5 (6, 7, so on).
3) When I've tried to automate this process (OSD creation) with ansible, 
I found that it very prone to time-based errors, like 'partition busy', 
or 'too many partition created in raw and not every one is visible at 
next stage). Worse: even if I add blockdev --rereadpt stage, it fails 
with 'device busy' message. I spend whole day trying to do it right, but 
at the end of the day it was still '50% of fail' when creating 8+ OSD in 
a row. (And I can't do it 'one by one' - see para. 1)


On  the next day I remade playbook on LVM. It takes just 1 hr (with 
debug) and it works perfectly - no a single race condition. And whole 
playbook shrinks ~3 times:


All steps:
- Configure udev to change LV owner to ceph
- Create volume group for journals
- Create logical volumes for journals
- Create data partition
- Create XFS filesystem
- Create directory
- temporal mount
- chown for directory
- Create OSD filesystem
- Create symlink for journal
- Add OSD to ceph
- Add auth in ceph
- unmount temp. mount
- Activate OSD via GPT

And that's all.

About performance issue for LVM. I think it is negligible (if we will 
not play with copy-on-write snapshots and other strange things). For 
HDD-OSD with journals on SSD main concern is not IOPS or latency on the 
journal (HDD will gives big latency anyway), but throughput. Single SSD 
capable for 300-500MB/s of linear writing, and ~10 HDD behind it  can 
gives up to 1.5GB/s.


device mapper is pretty fast thing if it just doing remapping.


On 07/07/2016 05:22 AM, Christian Balzer wrote:

Hello,

I have a multitude of of problems with the benchmarks and conclusions
here, more below.

But firstly to address the question of the OP, definitely not filesystem
based journals.
Another layer of overhead and delays, something I'd be willing to ignore
if we're talking about a full SSD as OSD with an inline journal, but not
with journal SSDs.
Similar with LVM, though with a lower impact.

Partitions really are your best bet.

On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:


Yes.

On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL
SSDSC2BB800G4 (800G, 9 journals)

First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good
journal device, even if it had the performance.
If you search in the ML archives there is at least one case where somebody
lost a full storage node precisely because their DC S3500s were worn out:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28083.html

Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
price) would be a better deal, at 50% more endurance and only slightly
lower sequential write speed.

And depending on your expected write volume (which you should
know/estimate as close as possible before buying HW), a 400GB DC S3710
might be the best deal when it comes to TBW/$.
It's 30% more expensive than your 3510, but has the same speed and an
endurance that's 5 times greater.


during random write I got ~90%
utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With
linear writing it somehow worse: I got 250Mb/s on SSD, which translated
to 240Mb of all OSD combined.


This test shows us a lot of things, mostly the failings of filestore.
But only partially if a SSD is a good fit for journals or not.

How are you measuring these things on the storage node, iostat, atop?
At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
about/over 50% utilization, given that its top speed is 460MB/s.

With Intel DC SSDs you can pretty much take the sequential write speed
from their specifications page and roughly expect that to be the speed of
your journal.

For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA
HDDs will give us this when running "ceph tell osd.nn bench" in
parallel against 2 OSDs that share a journal SSD:
---
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdd   0.00 2.000.00  409.50 0.00 191370.75   934.66   
146.52  356.460.00  356.46   2.44 100.00
sdl   0.0085.500.50  120.50 2.00 49614.00   820.10 
2.25   18.510.00   18.59   8.20  99.20
sdk   0.0089.501.50  119.00 6.00 49348.00   819.15 
2.04   16.910.00   17.13   8.23  99.20
---

Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
And the SSD is nearly at 200MB/s (and 100%).
For the record, that bench command is good for testing, but the result:
---
# ceph tell osd.30 bench

Re: [ceph-users] multiple journals on SSD

2016-07-07 Thread Nick Fisk
Just to add if you really want to go with lots of HDD's to Journals then go
NVME. They are not a lot more expensive than the equivalent SATA based
3700's, but the latency is low low low. Here is an example of a node I have
just commissioned with 12 HDD's to one P3700

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sdb   0.00 0.00   68.000.00  8210.00 0.00   241.47
0.263.853.850.00   2.09  14.20
sdd   2.50 0.00  198.50   22.00 24938.00  9422.00   311.66
4.34   27.806.21  222.64   2.45  54.00
sdc   0.00 0.00   63.000.00  7760.00 0.00   246.35
0.152.162.160.00   1.56   9.80
sda   0.00 0.00   61.50   47.00  7600.00 22424.00   553.44
2.77   25.572.63   55.57   3.82  41.40
nvme0n1   0.0022.502.00 2605.00 8.00 139638.00   107.13
0.140.050.000.05   0.03   6.60
sdg   0.00 0.00   61.00   28.00  6230.00 12696.00   425.30
3.66   74.795.84  225.00   3.87  34.40
sdf   0.00 0.00   34.50   47.00  4108.00 21702.00   633.37
3.56   43.751.51   74.77   2.85  23.20
sdh   0.00 0.00   75.00   15.50  9180.00  4984.00   313.02
0.45   12.553.28   57.42   3.51  31.80
sdi   1.50 0.50  142.00   48.50 18102.00 21924.00   420.22
3.60   18.924.99   59.71   2.70  51.40
sdj   0.50 0.00   74.505.00  9362.00  1832.00   281.61
0.334.103.33   15.60   2.44  19.40
sdk   0.00 0.00   54.000.00  6420.00 0.00   237.78
0.122.302.300.00   1.70   9.20
sdl   0.00 0.00   21.001.50  2286.0016.00   204.62
0.32   18.13   13.81   78.67   6.67  15.00
sde   0.00 0.00   98.000.00 12304.00 0.00   251.10
0.303.103.100.00   2.08  20.40

50us latency at 2605 iops!!!

Compared to one of the other nodes with 2 100GB S3700's, 6 disks each

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda   0.0030.500.00  894.50 0.00 50082.00   111.98
0.360.410.000.41   0.20  17.80
sdb   0.00 9.000.00  551.00 0.00 32044.00   116.31
0.230.420.000.42   0.19  10.40
sdc   0.00 2.006.50   17.50   278.00  8422.00   725.00
1.08   44.92   18.46   54.74   8.08  19.40
sdd   0.00 0.000.000.00 0.00 0.00 0.00
0.000.000.000.00   0.00   0.00
sde   0.00 2.50   27.50   21.50  2112.00  9866.00   488.90
0.59   12.046.91   18.60   6.53  32.00
sdf   0.50 0.00   50.500.00  6170.00 0.00   244.36
0.184.634.630.00   2.10  10.60
md1   0.00 0.000.000.00 0.00 0.00 0.00
0.000.000.000.00   0.00   0.00
md0   0.00 0.000.000.00 0.00 0.00 0.00
0.000.000.000.00   0.00   0.00
sdg   0.00 1.50   32.00  386.50  3970.00 12188.0077.22
0.150.350.500.34   0.15   6.40
sdh   0.00 0.006.000.0034.00 0.0011.33
0.07   12.67   12.670.00  11.00   6.60
sdi   0.00 0.501.50   19.50 6.00  8862.00   844.57
0.96   45.71   33.33   46.67   6.57  13.80
sdj   0.00 0.00   67.000.00  8214.00 0.00   245.19
0.172.512.510.00   1.88  12.60
sdk   1.50 2.50   61.00   48.00  6216.00 21020.00   499.74
2.01   18.46   11.41   27.42   5.05  55.00
sdm   0.00 0.00   30.500.00  3576.00 0.00   234.49
0.072.432.430.00   1.90   5.80
sdl   0.00 4.50   25.00   23.50  2092.00 12648.00   607.84
1.36   19.425.60   34.13   4.04  19.60
sdn   0.50 0.00   23.000.00  2670.00 0.00   232.17
0.072.962.960.00   2.43   5.60

Pretty much 10x the latency. I'm seriously impressed with these NVME things.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 07 July 2016 03:23
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] multiple journals on SSD
> 
> 
> Hello,
> 
> I have a multitude of of problems with the benchmarks and conclusions
here,
> more below.
> 
> But firstly to address the question of the OP, definitely not filesystem
based
> journals.
> Another layer of overhead and delays, something I'd be willing to ignore
if
> we're talking about a full SSD as OSD with an inline journal, but not with
> journal SSDs.
> Similar with LVM, though with a lower impact.
> 
> Partitions really are your best bet.
> 
> On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:
> 
> > Yes.
> >
> > On m

Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Tu Holmes
I have 12 journals on 1 SSD, but I wouldn't recommend it if you want any
real performance.

I use it on an archive type environment.

On Wed, Jul 6, 2016 at 9:01 PM Goncalo Borges 
wrote:

> Hi George...
>
>
> On my latest deployment we have set
>
> # grep journ /etc/ceph/ceph.conf
> osd journal size = 2
>
> and configured the OSDs for each device running 'ceph-disk prepare'
>
> # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type xfs
> /dev/sdd /dev/sdb
> # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type xfs
> /dev/sde /dev/sdb
> # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type xfs
> /dev/sdf /dev/sdb
> # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type xfs
> /dev/sdg /dev/sdb
>
> where sdb is an SSD.  Once the previous commands finish, this is the
> partition layout they create for the journal
>
> # parted -s /dev/sdb p
> Model: DELL PERC H710P (scsi)
> Disk /dev/sdb: 119GB
> Sector size (logical/physical): 512B/512B
> Partition Table: gpt
> Disk Flags:
>
> Number  Start   End SizeFile system  Name  Flags
>  1  1049kB  21.0GB  21.0GB   ceph journal
>  2  21.0GB  41.9GB  21.0GB   ceph journal
>  3  41.9GB  62.9GB  21.0GB   ceph journal
>  4  62.9GB  83.9GB  21.0GB   ceph journal
>
>
> However, never tested more than 4 journals per ssd.
>
> Cheers
> G.
>
>
> On 07/06/2016 10:03 PM, George Shuklin wrote:
>
> Hello.
>
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> sometime do not appear after partition creation). And I'm thinking that
> partition is not that useful for OSD management, because linux do no allow
> partition rereading with it contains used volumes.
>
> So my question: How you store many journals on SSD? My initial thoughts:
>
> 1)  filesystem with filebased journals
> 2) LVM with volumes
>
> Anything else? Best practice?
>
> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> --
> Goncalo Borges
> Research Computing
> ARC Centre of Excellence for Particle Physics at the Terascale
> School of Physics A28 | University of Sydney, NSW  2006
> T: +61 2 93511937
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Goncalo Borges

Hi George...


On my latest deployment we have set

   # grep journ /etc/ceph/ceph.conf
   osd journal size = 2

and configured the OSDs for each device running 'ceph-disk prepare'

   # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type
   xfs /dev/sdd /dev/sdb
   # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type
   xfs /dev/sde /dev/sdb
   # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type
   xfs /dev/sdf /dev/sdb
   # ceph-disk -v prepare --cluster ceph --cluster-uuid XXX --fs-type
   xfs /dev/sdg /dev/sdb

where sdb is an SSD.  Once the previous commands finish, this is the 
partition layout they create for the journal


   # parted -s /dev/sdb p
   Model: DELL PERC H710P (scsi)
   Disk /dev/sdb: 119GB
   Sector size (logical/physical): 512B/512B
   Partition Table: gpt
   Disk Flags:

   Number  Start   End SizeFile system  Name  Flags
 1  1049kB  21.0GB  21.0GB   ceph journal
 2  21.0GB  41.9GB  21.0GB   ceph journal
 3  41.9GB  62.9GB  21.0GB   ceph journal
 4  62.9GB  83.9GB  21.0GB   ceph journal


However, never tested more than 4 journals per ssd.

Cheers
G.

On 07/06/2016 10:03 PM, George Shuklin wrote:

Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD. I 
stumble on issues with multiple partitions (>4) and UDEV (sda5, 
sda6,etc sometime do not appear after partition creation). And I'm 
thinking that partition is not that useful for OSD management, because 
linux do no allow partition rereading with it contains used volumes.


So my question: How you store many journals on SSD? My initial thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--
Goncalo Borges
Research Computing
ARC Centre of Excellence for Particle Physics at the Terascale
School of Physics A28 | University of Sydney, NSW  2006
T: +61 2 93511937

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Christian Balzer

Hello,

I have a multitude of of problems with the benchmarks and conclusions
here, more below.

But firstly to address the question of the OP, definitely not filesystem
based journals. 
Another layer of overhead and delays, something I'd be willing to ignore
if we're talking about a full SSD as OSD with an inline journal, but not
with journal SSDs.
Similar with LVM, though with a lower impact.

Partitions really are your best bet.

On Wed, 6 Jul 2016 18:20:43 +0300 George Shuklin wrote:

> Yes.
> 
> On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL 
> SSDSC2BB800G4 (800G, 9 journals) 

First and foremost, a DC 3510 with 1 DWPD endurance is not my idea of good
journal device, even if it had the performance.
If you search in the ML archives there is at least one case where somebody
lost a full storage node precisely because their DC S3500s were worn out:
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28083.html

Unless you have a read-mostly cluster, a 400GB DC S3610 (same or lower
price) would be a better deal, at 50% more endurance and only slightly
lower sequential write speed.

And depending on your expected write volume (which you should
know/estimate as close as possible before buying HW), a 400GB DC S3710
might be the best deal when it comes to TBW/$.
It's 30% more expensive than your 3510, but has the same speed and an
endurance that's 5 times greater.

> during random write I got ~90% 
> utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With 
> linear writing it somehow worse: I got 250Mb/s on SSD, which translated 
> to 240Mb of all OSD combined.
> 
This test shows us a lot of things, mostly the failings of filestore.
But only partially if a SSD is a good fit for journals or not.

How are you measuring these things on the storage node, iostat, atop?
At 250MB/s (Mb would be mega-bit) your 800 GB DC S3500 should register
about/over 50% utilization, given that its top speed is 460MB/s.

With Intel DC SSDs you can pretty much take the sequential write speed
from their specifications page and roughly expect that to be the speed of
your journal.

For example a 100GB DC S3700 (200MB/s) doing journaling for 2 plain SATA
HDDs will give us this when running "ceph tell osd.nn bench" in
parallel against 2 OSDs that share a journal SSD:
---
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdd   0.00 2.000.00  409.50 0.00 191370.75   934.66   
146.52  356.460.00  356.46   2.44 100.00
sdl   0.0085.500.50  120.50 2.00 49614.00   820.10 
2.25   18.510.00   18.59   8.20  99.20
sdk   0.0089.501.50  119.00 6.00 49348.00   819.15 
2.04   16.910.00   17.13   8.23  99.20
---

Where sdd is the journal SSD and sdl/sdk are the OSD HDDs.
And the SSD is nearly at 200MB/s (and 100%).
For the record, that bench command is good for testing, but the result:
---
# ceph tell osd.30 bench 
{
"bytes_written": 1073741824,
"blocksize": 4194304,
"bytes_per_sec": 100960114.00
}
---
should be taken with a grain of salt, realistically those OSDs can do
about 50MB/s sustained.

On another cluster I have 200GB DC S3700s (360MB/s), holding 3 journals
for 4 disk RAID10 (4GB HW cache Areca controller) OSDs.
Thus the results are more impressive:
---
Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda   0.00   381.000.00  485.00 0.00 200374.00   826.28 
3.166.490.006.49   1.53  74.20
sdb   0.00   350.501.00  429.00 4.00 177692.00   826.49 
2.786.464.006.46   1.53  65.60
sdg   0.00 1.000.00  795.00 0.00 375514.50   944.69   
143.68  180.430.00  180.43   1.26 100.00
---

Where sda/sdb are the OSD RAIDs and sdg is the journal SSD.
Again, a near perfect match to the Intel specifications and also an
example where the journal is the bottleneck (never mind that his cluster
is all about IOPS, not throughput).

As for the endurance mentioned above, these 200GB DC 3700s are/were
overkill:
---
233 Media_Wearout_Indicator 0x0032   098   098   000Old_age   Always   
-   0
241 Host_Writes_32MiB   0x0032   100   100   000Old_age   Always   
-   4818100
242 Host_Reads_32MiB0x0032   100   100   000Old_age   Always   
-   84403
---

Again, this cluster is all about (small) IOPS, it only sees about 5MB/s
sustained I/O. 
So a 3610 might been a better fit, but not only didn't they exist back
then, it would have to be the 400GB model to match the speed, which is
more expensive.
A DC S3510 would be down 20% in terms of wearout (assuming same size) and
of course significantly slower.
With a 480GB 3510 (similar speed) it would still be about 10% worn out and
thus still no match for the expected life time of this cluster.

The 

Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread George Shuklin

Yes.

On my lab (not production yet) with 9 7200 SATA (OSD) and one INTEL 
SSDSC2BB800G4 (800G, 9 journals) during random write I got ~90% 
utilization of 9 HDD with ~5% utilization of SSD (2.4k IOPS). With 
linear writing it somehow worse: I got 250Mb/s on SSD, which translated 
to 240Mb of all OSD combined.


Obviously, it sucked with cold randread too (as expected).

Just for comparacment, my baseline benchmark (fio/librbd, 4k, 
iodepth=32, randwrite) for single OSD in the pool with size=1:


Intel 53x and Pro 2500 Series SSDs - 600 IOPS
Intel 730 and DC S35x0/3610/3700 Series SSDs - 6605 IOPS
Samsung SSD 840 Series - 739 IOPS
EDGE Boost Pro Plus 7mm - 1000 IOPS

(so 3500 is clear winner)

On 07/06/2016 03:22 PM, Alwin Antreich wrote:

Hi George,

interesting result for your benchmark. May you please supply some more numbers? 
As we didn't get that good of a result
on our tests.

Thanks.

Cheers,
Alwin


On 07/06/2016 02:03 PM, George Shuklin wrote:

Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD. I stumble on 
issues with multiple partitions (>4)
and UDEV (sda5, sda6,etc sometime do not appear after partition creation). And 
I'm thinking that partition is not that
useful for OSD management, because linux do no allow partition rereading with 
it contains used volumes.

So my question: How you store many journals on SSD? My initial thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Iban Cabrillo
Hi George,

  We have several journal partition on our SSDs too. Using ceph-deploy
utility (as Dan mentioned before),I think it is best way:

 ceph-deploy osd create HOST:DISK[:JOURNAL] [HOST:DISK[:JOURNAL]...]

  where journal will be the path to journal disk (not to partition):

  ceph-deploy osd create osdserver01:sda:/dev/sdm
 osdserver01:/sdb:/dev/sdm  #(this will create  sda1, sdm1 and sdb1, sdm2 ,
sdm1 and sdm2 journal).

  Be sure the disks are empty using "ceph-deploy osd list" and "ceph-deploy
osd zap".

regards, I

2016-07-06 14:11 GMT+02:00 Dan van der Ster :

> We have 5 journal partitions per SSD. Works fine (on el6 and el7).
>
> Best practice is to use ceph-disk:
>
>   ceph-disk prepare /dev/sde /dev/sdc # where e is the osd, c is an SSD.
>
> -- Dan
>
>
> On Wed, Jul 6, 2016 at 2:03 PM, George Shuklin 
> wrote:
> > Hello.
> >
> > I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> > stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> > sometime do not appear after partition creation). And I'm thinking that
> > partition is not that useful for OSD management, because linux do no
> allow
> > partition rereading with it contains used volumes.
> >
> > So my question: How you store many journals on SSD? My initial thoughts:
> >
> > 1)  filesystem with filebased journals
> > 2) LVM with volumes
> >
> > Anything else? Best practice?
> >
> > P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 

Iban Cabrillo Bartolome
Instituto de Fisica de Cantabria (IFCA)
Santander, Spain
Tel: +34942200969
PGP PUBLIC KEY:
http://pgp.mit.edu/pks/lookup?op=get=0xD9DF0B3D6C8C08AC

Bertrand Russell:
*"El problema con el mundo es que los estúpidos están seguros de todo y los
inteligentes están llenos de dudas*"
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Alwin Antreich
Hi George,

interesting result for your benchmark. May you please supply some more numbers? 
As we didn't get that good of a result
on our tests.

Thanks.

Cheers,
Alwin


On 07/06/2016 02:03 PM, George Shuklin wrote:
> Hello.
> 
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I 
> stumble on issues with multiple partitions (>4)
> and UDEV (sda5, sda6,etc sometime do not appear after partition creation). 
> And I'm thinking that partition is not that
> useful for OSD management, because linux do no allow partition rereading with 
> it contains used volumes.
> 
> So my question: How you store many journals on SSD? My initial thoughts:
> 
> 1)  filesystem with filebased journals
> 2) LVM with volumes
> 
> Anything else? Best practice?
> 
> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Dan van der Ster
We have 5 journal partitions per SSD. Works fine (on el6 and el7).

Best practice is to use ceph-disk:

  ceph-disk prepare /dev/sde /dev/sdc # where e is the osd, c is an SSD.

-- Dan


On Wed, Jul 6, 2016 at 2:03 PM, George Shuklin  wrote:
> Hello.
>
> I've been testing Intel 3500 as journal store for few HDD-based OSD. I
> stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
> sometime do not appear after partition creation). And I'm thinking that
> partition is not that useful for OSD management, because linux do no allow
> partition rereading with it contains used volumes.
>
> So my question: How you store many journals on SSD? My initial thoughts:
>
> 1)  filesystem with filebased journals
> 2) LVM with volumes
>
> Anything else? Best practice?
>
> P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-06 Thread Tuomas Juntunen
Hi

Maybe the easiest way would be to just create files to the SSD and use those
as journals. Don't know if this creates too much overhead, but atleast it
would be simple.

Br,
T



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
George Shuklin
Sent: 6. heinäkuuta 2016 15:04
To: ceph-users@lists.ceph.com
Subject: [ceph-users] multiple journals on SSD

Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD. I
stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc
sometime do not appear after partition creation). And I'm thinking that
partition is not that useful for OSD management, because linux do no allow
partition rereading with it contains used volumes.

So my question: How you store many journals on SSD? My initial thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] multiple journals on SSD

2016-07-06 Thread George Shuklin

Hello.

I've been testing Intel 3500 as journal store for few HDD-based OSD. I 
stumble on issues with multiple partitions (>4) and UDEV (sda5, sda6,etc 
sometime do not appear after partition creation). And I'm thinking that 
partition is not that useful for OSD management, because linux do no 
allow partition rereading with it contains used volumes.


So my question: How you store many journals on SSD? My initial thoughts:

1)  filesystem with filebased journals
2) LVM with volumes

Anything else? Best practice?

P.S. I've done benchmarking: 3500 can support up to 16 10k-RPM HDD.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com