Re: [ceph-users] Ceph RBD latencies

2016-03-06 Thread Christian Balzer

Hello,

On Mon, 7 Mar 2016 00:38:46 + Adrian Saul wrote:

> > >The Samsungs are the 850 2TB
> > > (MZ-75E2T0BW).  Chosen primarily on price.
> >
> > These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5
> > years). Unless you have a read-only cluster, you will wind up spending
> > MORE on replacing them (and/or loosing data when 2 fail at the same
> > time) than going with something more sensible like Samsung's DC models
> > or the Intel DC ones (S3610s come to mind for "normal" use).
> > See also the current "List of SSDs" thread in this ML.
> 
> This was a metric I struggled to find and would have been useful in
> comparison.  I am sourcing prices on the SM863s anyway.  That SSD thread
> has been good to follow as well.
> 
Yeah, they are most likely a better fit and if they are doing OK with sync
writes you could most likely get away with having their journals on them
same SSD.

> > Fast, reliable, cheap. Pick any 2.
> 
> Yup - unfortunately cheap is fixed, reliable is the reason we are doing
> this however fast is now a must have.  the normal
> engineering/management dilemma.
> 
Indeed.

> > On your test setup or even better the Solaris one, have a look at
> > their media wearout, or  Wear_Leveling_Count as Samsung calls it.
> > I bet that makes for some scary reading.
> 
> For the Evos we found no tools we could use on Solaris - also because we
> have cheap nasty SAS interposers in that setup most tools don't work
> anyway.  Until we pull a disk and put it into a windows box we can't do
> any sort of diagnostics on it.  It would be useful to see because we
> have those disks taking a fair brunt of our performance workload now.
> 
smartmontools aka smartctl not working for you, presumably because of the
intermediate SAS shenanigans?

> > Note that Ceph (RBD/RADOS to be precise) isn't particular suited for
> > "long" distance replication due to the incurred latencies.
> >
> > That's unless your replication is happening "above" Ceph in the iSCSI
> > bits with something that's more optimized for this.
> >
> > Something along the lines of the DRBD proxy has been suggested for
> > Ceph, but if at all it is a backburner project at best from what I
> > gather.
> 
> We can fairly easily do low latency links (telco) but are looking at the
> architecture to try and limit that sort of long replication - doing
> replication at application and database levels instead.  The site to
> site replication would be limited to some clusters or applications that
> need sync replication for availability.
> 
Yeah, I figured the Telco part, but for our planned DC move I ran
some numbers and definitely want to stay below 10km between them
(Infiniband here).

Note that you can of course create CRUSH rules that will give you either
location replicated or only locally replicated OSDs and thus pools, but it
may be a bit daunting at first.

> > There are some ways around this, which may or may not be suitable for
> > your use case.
> > EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
> > Of course this comes at a performance penalty, which you can offset
> > again with for example fast RAID controllers with HW cache to some
> > extend. But it may well turn out to be zero sum game.
> 
> I modelled an EC setup but that was at a multi site level with local
> cache tiers in front, and it was going to be too big a challenge to do
> as a new untested platform with too many latency questions.  Within a
> site EC was to going to be cost effective as to do properly I would need
> to up the number of hosts and that pushed the pricing up too far, even
> if I went with smaller less configured hosts.
> 
Yes, the per-node basic cost can be an issue, but Ceph really likes many
smallish things over few largish ones for the same size.
 
> I thought about hardware RAID as well, but as I would need to do host
> level redundancy anyway it was not gaining any efficiency - less risk
> but I would still need to replicate anyway so why not just go disk to
> disk.  More than likely I would quietly work in higher protection as we
> go live and deal with it later as a capacity expansion.
> 
The later sounds like a plan.
For the former consider this simple example:
4 storage nodes, each with 4 RAID6 OSDs, Ceph size=2 and min_size=1,
mon_osd_down_out_subtree_limit = host.

In this scenario you can loose any 2 disks w/o an OSD going down, up to 4
disks w/o data loss and a whole node as well w/o the cluster stopping. 
The mon_osd_down_out_subtree_limit will also stop things from rebalancing
in case of a node crash/reboot, until you decide so otherwise manually.
The idea here is that it's likely a lot quicker to get a node back up than
to reshuffle all that data.

With normal, size 2 replication and single disk OSDs, any
simultaneous/overlapping loss of 2 disks is going to loose you data,
potentially effecting many if not ALL of your VM images.

There have been a lot of discussion about reliability with various
replication 

Re: [ceph-users] Ceph RBD latencies

2016-03-06 Thread Adrian Saul
> >The Samsungs are the 850 2TB
> > (MZ-75E2T0BW).  Chosen primarily on price.
>
> These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5 years).
> Unless you have a read-only cluster, you will wind up spending MORE on
> replacing them (and/or loosing data when 2 fail at the same time) than going
> with something more sensible like Samsung's DC models or the Intel DC ones
> (S3610s come to mind for "normal" use).
> See also the current "List of SSDs" thread in this ML.

This was a metric I struggled to find and would have been useful in comparison. 
 I am sourcing prices on the SM863s anyway.  That SSD thread has been good to 
follow as well.

> Fast, reliable, cheap. Pick any 2.

Yup - unfortunately cheap is fixed, reliable is the reason we are doing this 
however fast is now a must have.  the normal engineering/management dilemma.

> On your test setup or even better the Solaris one, have a look at their media
> wearout, or  Wear_Leveling_Count as Samsung calls it.
> I bet that makes for some scary reading.

For the Evos we found no tools we could use on Solaris - also because we have 
cheap nasty SAS interposers in that setup most tools don't work anyway.  Until 
we pull a disk and put it into a windows box we can't do any sort of 
diagnostics on it.  It would be useful to see because we have those disks 
taking a fair brunt of our performance workload now.

> Note that Ceph (RBD/RADOS to be precise) isn't particular suited for "long"
> distance replication due to the incurred latencies.
>
> That's unless your replication is happening "above" Ceph in the iSCSI bits 
> with
> something that's more optimized for this.
>
> Something along the lines of the DRBD proxy has been suggested for Ceph,
> but if at all it is a backburner project at best from what I gather.

We can fairly easily do low latency links (telco) but are looking at the 
architecture to try and limit that sort of long replication - doing replication 
at application and database levels instead.  The site to site replication would 
be limited to some clusters or applications that need sync replication for 
availability.

> There are some ways around this, which may or may not be suitable for your
> use case.
> EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
> Of course this comes at a performance penalty, which you can offset again
> with for example fast RAID controllers with HW cache to some extend.
> But it may well turn out to be zero sum game.

I modelled an EC setup but that was at a multi site level with local cache 
tiers in front, and it was going to be too big a challenge to do as a new 
untested platform with too many latency questions.  Within a site EC was to 
going to be cost effective as to do properly I would need to up the number of 
hosts and that pushed the pricing up too far, even if I went with smaller less 
configured hosts.

I thought about hardware RAID as well, but as I would need to do host level 
redundancy anyway it was not gaining any efficiency - less risk but I would 
still need to replicate anyway so why not just go disk to disk.  More than 
likely I would quietly work in higher protection as we go live and deal with it 
later as a capacity expansion.

> Another thing is to use a cache pool (with top of the line SSDs), this is of
> course only a sensible course of action if your hot objects will fit in there.
> In my case they do (about 10-20% of the 2.4TB raw pool capacity) and
> everything is as fast as can be expected and the VMs (their time
> critical/sensitive application to be precise) are happy campers.

This is the model I am working to - our "fast" workloads using SSD caches  in 
front of bulk SATA, sizing the SSDs at around 25% of the capacity we require 
for "fast" storage.

For the "bulk" storage I would still use the SSD cache but sized to 10% of the 
SATA usable capacity.   I figure once we get live we can adjust numbers as 
required - expand with more cache hosts if needed.

> There's a counter in Ceph (counter-filestore_journal_bytes) that you can
> graph for journal usage.
> The highest I have ever seen is about 100MB for HDD based OSDs, less than
> 8MB for SSD based ones with default(ish) Ceph parameters.
>
> Since you seem to have experience with ZFS (I don't really, but I read alot
> ^o^), consider the Ceph journal equivalent to the ZIL.
> It is a write only journal, it never gets read from unless there is a crash.
> That is why sequential, sync write speed is the utmost criteria for Ceph
> journal device.
>
> If I recall correctly you were testing with 4MB block streams, thus pretty
> much filling the pipe to capacity, atop on your storage nodes will give a good
> insight.
>
> The journal is great to cover some bursts, but the Ceph OSD is flushing things
> from RAM to the backing storage on configurable time limits and once these
> are exceeded and/or you run out RAM (pagecache), you are limited to what
> your backing storage can sustain.
>
> Now in real 

Re: [ceph-users] Ceph RBD latencies

2016-03-04 Thread Christian Balzer

Hello,

On Thu, 3 Mar 2016 23:26:13 + Adrian Saul wrote:

> 
> > Samsung EVO...
> > Which exact model, I presume this is not a DC one?
> >
> > If you had put your journals on those, you would already be pulling
> > your hairs out due to abysmal performance.
> >
> > Also with Evo ones, I'd be worried about endurance.
> 
> No,  I am using the P3700DCs for journals.  

Yup, thats why I wrote "If you had...". ^o^

>The Samsungs are the 850 2TB
> (MZ-75E2T0BW).  Chosen primarily on price.  

These are spec'ed at 150TBW, or an amazingly low 0.04 DWPD (over 5 years).
Unless you have a read-only cluster, you will wind up spending MORE on
replacing them (and/or loosing data when 2 fail at the same time) than
going with something more sensible like Samsung's DC models or the Intel
DC ones (S3610s come to mind for "normal" use). 
See also the current "List of SSDs" thread in this ML.

>We already built a system
> using the 1TB models with Solaris+ZFS and I have little faith in them.
> Certainly their write performance is erratic and not ideal.  We have
> other vendor options which are what they call "Enterprise Value" SSDs,
> but still 4x the price.   I would prefer a higher grade drive but
> unfortunately cost is being driven from above me.
>
Fast, reliable, cheap. Pick any 2. 

On your test setup or even better the Solaris one, have a look at their
media wearout, or  Wear_Leveling_Count as Samsung calls it.
I bet that makes for some scary reading.

> > > On the ceph side each disk in the OSD servers are setup as an
> > > individual OSD, with a 12G journal created on the flash mirror.   I
> > > setup the SSD servers into one root, and the SATA servers into
> > > another and created pools using hosts as fault boundaries, with the
> > > pools set for 2 copies.
> > Risky. If you have very reliable and well monitored SSDs you can get
> > away with 2 (I do so), but with HDDs and the combination of their
> > reliability and recovery time it's asking for trouble.
> > I realize that this is testbed, but if your production has a
> > replication of 3 you will be disappointed by the additional latency.
> 
> Again, cost - the end goal will be we build metro based dual site pools
> which will be 2+2 replication.  
Note that Ceph (RBD/RADOS to be precise) isn't particular suited for
"long" distance replication due to the incurred latencies. 

That's unless your replication is happening "above" Ceph in the iSCSI bits
with something that's more optimized for this. 

Something along the lines of the DRBD proxy has been suggested for Ceph,
but if at all it is a backburner project at best from what I gather.


> I am aware of the risks but already
> presenting numbers based on buying 4x the disk we are able to use gets
> questioned hard.
> 
There are some ways around this, which may or may not be suitable for your
use case.
EC pools (or RAID'ed OSDs, which I prefer) for HDD based pools.
Of course this comes at a performance penalty, which you can offset again
with for example fast RAID controllers with HW cache to some extend.
But it may well turn out to be zero sum game.

Another thing is to use a cache pool (with top of the line SSDs), this is
of course only a sensible course of action if your hot objects will fit in
there.
In my case they do (about 10-20% of the 2.4TB raw pool capacity) and
everything is as fast as can be expected and the VMs (their time
critical/sensitive application to be precise) are happy campers.

> > This smells like garbage collection on your SSDs, especially since it
> > matches time wise what you saw on them below.
> 
> I concur.   I am just not sure why that impacts back to the client when
> from the client perspective the journal should hide this.   If the
> journal is struggling to keep up and has to flush constantly then
> perhaps, but  on the current steady state IO rate I am testing with I
> don't think the journal should be that saturated.
>
There's a counter in Ceph (counter-filestore_journal_bytes) that you can
graph for journal usage. 
The highest I have ever seen is about 100MB for HDD based OSDs, less than
8MB for SSD based ones with default(ish) Ceph parameters. 

Since you seem to have experience with ZFS (I don't really, but I read
alot ^o^), consider the Ceph journal equivalent to the ZIL.  
It is a write only journal, it never gets read from unless there is a
crash.
That is why sequential, sync write speed is the utmost criteria for Ceph
journal device.

If I recall correctly you were testing with 4MB block streams, thus pretty
much filling the pipe to capacity, atop on your storage nodes will give a
good insight. 

The journal is great to cover some bursts, but the Ceph OSD is flushing
things from RAM to the backing storage on configurable time limits and
once these are exceeded and/or you run out RAM (pagecache), you are
limited to what your backing storage can sustain.

Now in real life, you would want a cluster and especially OSDs that are
lightly to medium loaded on average and 

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Adrian Saul

> Samsung EVO...
> Which exact model, I presume this is not a DC one?
>
> If you had put your journals on those, you would already be pulling your hairs
> out due to abysmal performance.
>
> Also with Evo ones, I'd be worried about endurance.

No,  I am using the P3700DCs for journals.  The Samsungs are the 850 2TB 
(MZ-75E2T0BW).  Chosen primarily on price.  We already built a system using the 
1TB models with Solaris+ZFS and I have little faith in them.  Certainly their 
write performance is erratic and not ideal.  We have other vendor options which 
are what they call "Enterprise Value" SSDs, but still 4x the price.   I would 
prefer a higher grade drive but unfortunately cost is being driven from above 
me.

> > On the ceph side each disk in the OSD servers are setup as an individual
> > OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> > servers into one root, and the SATA servers into another and created
> > pools using hosts as fault boundaries, with the pools set for 2
> > copies.
> Risky. If you have very reliable and well monitored SSDs you can get away
> with 2 (I do so), but with HDDs and the combination of their reliability and
> recovery time it's asking for trouble.
> I realize that this is testbed, but if your production has a replication of 3 
> you
> will be disappointed by the additional latency.

Again, cost - the end goal will be we build metro based dual site pools which 
will be 2+2 replication.  I am aware of the risks but already presenting 
numbers based on buying 4x the disk we are able to use gets questioned hard.

> This smells like garbage collection on your SSDs, especially since it matches
> time wise what you saw on them below.

I concur.   I am just not sure why that impacts back to the client when from 
the client perspective the journal should hide this.   If the journal is 
struggling to keep up and has to flush constantly then perhaps, but  on the 
current steady state IO rate I am testing with I don't think the journal should 
be that saturated.

> Have you tried the HDD based pool and did you see similar, consistent
> interval, spikes?

To be honest I have been focusing on the SSD numbers but that would be a good 
comparison.

> Or alternatively, configured 2 of your NVMEs as OSDs?

That was what I was thinking of doing - move the NVMEs to the frontends, make 
them OSDs and configure them as a read-forward cache tier for the other pools, 
and just have the SSDs and SATA journal by default on a first partition.

> No, not really. The journal can only buffer so much.
> There are several threads about this in the archives.
>
> You could tune it but that will only go so far if your backing storage can't 
> keep
> up.
>
> Regards,
>
> Christian


Agreed - Thanks for your help.
Confidentiality: This email and any attachments are confidential and may be 
subject to copyright, legal or some other professional privilege. They are 
intended solely for the attention and use of the named addressee(s). They may 
only be copied, distributed or disclosed with the consent of the copyright 
owner. If you have received this email by mistake or by breach of the 
confidentiality clause, please notify the sender immediately by return email 
and delete or destroy all copies of the email. Any confidentiality, privilege 
or copyright is not waived or lost because this email has been sent to you by 
mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Nick Fisk
You can also dump the historic ops from the OSD admin socket. It will give a 
brief overview of each step and how long each one is taking.

But generally what you are seeing is not unusual. Currently best case for a RBD 
on a replicated pool will be somewhere between 200-500 iops. The Ceph code is a 
lot more complex than a 30cm SAS cable.

CPU speed (ie Ghz not Cores) is a large factor in write latency. You may find 
that you can improve performance by setting the max c-state to 1 and enabling 
idle=poll, which stops the cores entering power saving states. I found on 
systems with a large number of cores unless you drive the whole box really 
hard, a lot of the cores clock themselves down which hurts latency.

Also disable all logging in your ceph.conf, this can have quite a big effect as 
well.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Jan Schermer
> Sent: 03 March 2016 14:38
> To: RDS <rs3...@me.com>
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Ceph RBD latencies
> 
> I think the latency comes from journal flushing
> 
> Try tuning
> 
> filestore min sync interval = .1
> filestore max sync interval = 5
> 
> and also
> /proc/sys/vm/dirty_bytes (I suggest 512MB)
> /proc/sys/vm/dirty_background_bytes (I suggest 256MB)
> 
> See if that helps
> 
> It would be useful to see the job you are running to know what exactly it
> does, I'm afraid your latency is not really that bad, it will scale 
> horizontally
> (with number of clients) rather than vertically (higher IOPS for single 
> blocking
> writes) and there's not much that can be done about that.
> 
> 
> > On 03 Mar 2016, at 14:33, RDS <rs3...@me.com> wrote:
> >
> > A couple of suggestions:
> > 1)   # of pgs per OSD should be 100-200
> > 2)  When dealing with SSD or Flash, performance of these devices hinge on
> how you partition them and how you tune linux:
> > a)   if using partitions, did you align the partitions on a 4k 
> > boundary? I
> start at sector 2048 using either fdisk or sfdisk
> 
> On SSD you should align at 8MB boundary (usually the erase block is quite
> large, though it doesn't matter that much), and the write block size is 
> actually
> something like 128k
> 2048 aligns at 1MB which is completely fine
> 
> > b)   There are quite a few Linux settings that benefit SSD/Flash and
> they are: Deadline io scheduler only when using the deadline associated
> settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
> setting
> read ahead if doing majority of reads, and other
> 
> those don't matter that much, higher queue depths mean larger throughput
> but at the expense of latency, the default are usually fine
> 
> > 3)   mount options:  noatime, delaylog,inode64,noquota, etc…
> 
> defaults work fine (noatime is a relic, relatime is what filesystems use by
> default nowadays)
> 
> >
> > I have written some papers/blogs on this subject if you are interested in
> seeing them.
> > Rick
> >> On Mar 3, 2016, at 2:41 AM, Adrian Saul
> <adrian.s...@tpgtelecom.com.au> wrote:
> >>
> >> Hi Ceph-users,
> >>
> >> TL;DR - I can't seem to pin down why an unloaded system with flash based
> OSD journals has higher than desired write latencies for RBD devices.  Any
> ideas?
> >>
> >>
> >> I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster.   Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform for
> developing into production.   The hardware is:
> >>
> >> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients
> using RBD - they present iSCSI to other systems).
> >> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB
> >> Samsung Evo SSDs each 3x 4RU OSD SATA servers (36 bay) - currently
> >> with 6 8TB Seagate each
> >>
> >> As part of the research and planning we opted to put a pair of Intel
> PC3700DC 400G NVME cards in each OSD server.  These are configured
> mirrored and setup as the journals for the OSD disks, the aim being to
> improve write latencies.  All the machines have 128G RAM and dual E5-
> 2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of
> switches.   All machines are running Centos 7, with the frontends using the
> 4.4.1 elrepo-ml kernel to get a later RBD kernel module.
> >>
> >> On the ceph side each disk in the OSD servers are setup as an individual
> OSD, with a 12G journal created on the flash mirror.   I setup the SSD servers
>

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Jan Schermer
I think the latency comes from journal flushing

Try tuning

filestore min sync interval = .1
filestore max sync interval = 5

and also
/proc/sys/vm/dirty_bytes (I suggest 512MB)
/proc/sys/vm/dirty_background_bytes (I suggest 256MB)

See if that helps

It would be useful to see the job you are running to know what exactly it does, 
I'm afraid your latency is not really that bad, it will scale horizontally 
(with number of clients) rather than vertically (higher IOPS for single 
blocking writes) and there's not much that can be done about that.


> On 03 Mar 2016, at 14:33, RDS  wrote:
> 
> A couple of suggestions:
> 1)   # of pgs per OSD should be 100-200
> 2)  When dealing with SSD or Flash, performance of these devices hinge on how 
> you partition them and how you tune linux:
>   a)   if using partitions, did you align the partitions on a 4k 
> boundary? I start at sector 2048 using either fdisk or sfdisk

On SSD you should align at 8MB boundary (usually the erase block is quite 
large, though it doesn't matter that much), and the write block size is 
actually something like 128k
2048 aligns at 1MB which is completely fine

>   b)   There are quite a few Linux settings that benefit SSD/Flash and 
> they are: Deadline io scheduler only when using the deadline associated 
> settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
> setting read ahead if doing majority of reads, and other

those don't matter that much, higher queue depths mean larger throughput but at 
the expense of latency, the default are usually fine

> 3)   mount options:  noatime, delaylog,inode64,noquota, etc…

defaults work fine (noatime is a relic, relatime is what filesystems use by 
default nowadays)

> 
> I have written some papers/blogs on this subject if you are interested in 
> seeing them.
> Rick
>> On Mar 3, 2016, at 2:41 AM, Adrian Saul  
>> wrote:
>> 
>> Hi Ceph-users,
>> 
>> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
>> journals has higher than desired write latencies for RBD devices.  Any ideas?
>> 
>> 
>> I am developing a storage system based on Ceph and an SCST+pacemaker 
>> cluster.   Our initial testing showed promising results even with mixed 
>> available hardware and we proceeded to order a more designed platform for 
>> developing into production.   The hardware is:
>> 
>> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using 
>> RBD - they present iSCSI to other systems).
>> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
>> SSDs each
>> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>> 
>> As part of the research and planning we opted to put a pair of Intel 
>> PC3700DC 400G NVME cards in each OSD server.  These are configured mirrored 
>> and setup as the journals for the OSD disks, the aim being to improve write 
>> latencies.  All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 
>> 4 aggregated 10G NICs back to a common pair of switches.   All machines are 
>> running Centos 7, with the frontends using the 4.4.1 elrepo-ml kernel to get 
>> a later RBD kernel module.
>> 
>> On the ceph side each disk in the OSD servers are setup as an individual 
>> OSD, with a 12G journal created on the flash mirror.   I setup the SSD 
>> servers into one root, and the SATA servers into another and created pools 
>> using hosts as fault boundaries, with the pools set for 2 copies.   I 
>> created the pools with the pg_num and pgp_num set to 32x the number of OSDs 
>> in the pool.   On the frontends we create RBD devices and present them as 
>> iSCSI LUNs using SCST to clients - in this test case a Solaris host.
>> 
>> The problem I have is that even with a lightly loaded system the service 
>> times for the LUNs for writes is just not getting down to where we want it, 
>> and they are not very stable - with 5 LUNs doing around 200 32K IOPS 
>> consistently the service times sit at around 3-4ms, but regularly (every 
>> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 
>> minutes.  I fully expected we would have some latencies due to the 
>> distributed and networked nature of Ceph, but in this instance I just cannot 
>> find where these latencies are coming from, especially with the SSD based 
>> pool and having flash based journaling.
>> 
>> - The RBD devices show relatively low service times, but high queue times.  
>> These are in line with what Solaris sees so I don't think SCST/iSCSI is 
>> adding much latency.
>> - The journals are reporting 0.02ms service times, and seem to cope fine 
>> with any bursts
>> - The SSDs do show similar latency variations with writes - bursting up to 
>> 12ms or more whenever there is high write workloads.
>> - I have tried applying what tuning I can to the SSD block devices (noop 
>> scheduler etc) - no difference
>> - I have removed any sort of smarts around IO grouping 

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread RDS
A couple of suggestions:
1)   # of pgs per OSD should be 100-200
2)  When dealing with SSD or Flash, performance of these devices hinge on how 
you partition them and how you tune linux:
a)   if using partitions, did you align the partitions on a 4k 
boundary? I start at sector 2048 using either fdisk or sfdisk
b)   There are quite a few Linux settings that benefit SSD/Flash and 
they are: Deadline io scheduler only when using the deadline associated 
settings, up  QDepth to 512 or 1024, set rq_affinity=2 if OS allows it, 
setting read ahead if doing majority of reads, and other
3)   mount options:  noatime, delaylog,inode64,noquota, etc…

I have written some papers/blogs on this subject if you are interested in 
seeing them.
Rick
> On Mar 3, 2016, at 2:41 AM, Adrian Saul  wrote:
> 
> Hi Ceph-users,
> 
> TL;DR - I can't seem to pin down why an unloaded system with flash based OSD 
> journals has higher than desired write latencies for RBD devices.  Any ideas?
> 
> 
>  I am developing a storage system based on Ceph and an SCST+pacemaker 
> cluster.   Our initial testing showed promising results even with mixed 
> available hardware and we proceeded to order a more designed platform for 
> developing into production.   The hardware is:
> 
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients using 
> RBD - they present iSCSI to other systems).
> 3x 2RU OSD SSD servers (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo 
> SSDs each
> 3x 4RU OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
> 
> As part of the research and planning we opted to put a pair of Intel PC3700DC 
> 400G NVME cards in each OSD server.  These are configured mirrored and setup 
> as the journals for the OSD disks, the aim being to improve write latencies.  
> All the machines have 128G RAM and dual E5-2630v3 CPUs, and use 4 aggregated 
> 10G NICs back to a common pair of switches.   All machines are running Centos 
> 7, with the frontends using the 4.4.1 elrepo-ml kernel to get a later RBD 
> kernel module.
> 
> On the ceph side each disk in the OSD servers are setup as an individual OSD, 
> with a 12G journal created on the flash mirror.   I setup the SSD servers 
> into one root, and the SATA servers into another and created pools using 
> hosts as fault boundaries, with the pools set for 2 copies.   I created the 
> pools with the pg_num and pgp_num set to 32x the number of OSDs in the pool.  
>  On the frontends we create RBD devices and present them as iSCSI LUNs using 
> SCST to clients - in this test case a Solaris host.
> 
> The problem I have is that even with a lightly loaded system the service 
> times for the LUNs for writes is just not getting down to where we want it, 
> and they are not very stable - with 5 LUNs doing around 200 32K IOPS 
> consistently the service times sit at around 3-4ms, but regularly (every 
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5 
> minutes.  I fully expected we would have some latencies due to the 
> distributed and networked nature of Ceph, but in this instance I just cannot 
> find where these latencies are coming from, especially with the SSD based 
> pool and having flash based journaling.
> 
> - The RBD devices show relatively low service times, but high queue times.  
> These are in line with what Solaris sees so I don't think SCST/iSCSI is 
> adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine with 
> any bursts
> - The SSDs do show similar latency variations with writes - bursting up to 
> 12ms or more whenever there is high write workloads.
> - I have tried applying what tuning I can to the SSD block devices (noop 
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no major 
> impact
> - I have tried tuning up filesystore  queue and wbthrottle values but could 
> not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait and 
> I can do benchmarks up over 1GB/s in some tests.  Write throughput can also 
> be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI client 
> block sizes (i.e 32K, 128K instead of 4M) but it seemed to make things worse. 
>  I would have thought better alignment would reduce latency but is that 
> offset buy the extra overhead in object work?
> 
> What I am looking for is what other areas do I need to look or diagnostics do 
> I need to work this out?  We would really like to use ceph across a mixed 
> workload that includes some DB systems that are fairly latency sensitive, but 
> as it stands its hard to be confident in the performance when a fairly quiet 
> unloaded system seems to struggle, even with all this hardware behind it.   I 
> get the impression that the SSD write latencies might be coming into play as 
> they are similar to the numbers I see, but really for 

Re: [ceph-users] Ceph RBD latencies

2016-03-03 Thread Christian Balzer

Hello,

On Thu, 3 Mar 2016 07:41:09 + Adrian Saul wrote:

> Hi Ceph-users,
> 
> TL;DR - I can't seem to pin down why an unloaded system with flash based
> OSD journals has higher than desired write latencies for RBD devices.
> Any ideas?
> 
> 
>   I am developing a storage system based on Ceph and an SCST+pacemaker
> cluster.   Our initial testing showed promising results even with mixed
> available hardware and we proceeded to order a more designed platform
> for developing into production.   The hardware is:
> 
> 2x 1RU servers as "frontends" (SCST+pacemaker - ceph mons and clients
> using RBD - they present iSCSI to other systems). 3x 2RU OSD SSD servers
> (24 bay 2.5" SSD) - currently with 4 2TB Samsung Evo SSDs each 3x 4RU
> OSD SATA servers (36 bay) - currently with 6 8TB Seagate each
>
Samsung EVO... 
Which exact model, I presume this is not a DC one?
 
If you had put your journals on those, you would already be pulling your
hairs out due to abysmal performance.

Also with Evo ones, I'd be worried about endurance.

>  As part of the research and planning we opted to put a pair of Intel
> PC3700DC 400G NVME cards in each OSD server.  These are configured
> mirrored and setup as the journals for the OSD disks, the aim being to
> improve write latencies.  All the machines have 128G RAM and dual
> E5-2630v3 CPUs, and use 4 aggregated 10G NICs back to a common pair of
> switches.   All machines are running Centos 7, with the frontends using
> the 4.4.1 elrepo-ml kernel to get a later RBD kernel module.
> 
> On the ceph side each disk in the OSD servers are setup as an individual
> OSD, with a 12G journal created on the flash mirror.   I setup the SSD
> servers into one root, and the SATA servers into another and created
> pools using hosts as fault boundaries, with the pools set for 2
> copies.   
Risky. If you have very reliable and well monitored SSDs you can get away
with 2 (I do so), but with HDDs and the combination of their reliability
and recovery time it's asking for trouble.
I realize that this is testbed, but if your production has a replication
of 3 you will be disappointed by the additional latency.

> I created the pools with the pg_num and pgp_num set to 32x the
> number of OSDs in the pool.   On the frontends we create RBD devices and
> present them as iSCSI LUNs using SCST to clients - in this test case a
> Solaris host.
> 
> The problem I have is that even with a lightly loaded system the service
> times for the LUNs for writes is just not getting down to where we want
> it, and they are not very stable - with 5 LUNs doing around 200 32K IOPS
> consistently the service times sit at around 3-4ms, but regularly (every
> 20-30 seconds) up to above 12-15ms which puts the average at 6ms over 5
> minutes.  

This smells like garbage collection on your SSDs, especially since it
matches time wise what you saw on them below.

>I fully expected we would have some latencies due to the
> distributed and networked nature of Ceph, but in this instance I just
> cannot find where these latencies are coming from, especially with the
> SSD based pool and having flash based journaling.
> 
> - The RBD devices show relatively low service times, but high queue
> times.  These are in line with what Solaris sees so I don't think
> SCST/iSCSI is adding much latency.
> - The journals are reporting 0.02ms service times, and seem to cope fine
> with any bursts
> - The SSDs do show similar latency variations with writes - bursting up
> to 12ms or more whenever there is high write workloads.
This.

Have you tried the HDD based pool and did you see similar, consistent
interval, spikes?

Or alternatively, configured 2 of your NVMEs as OSDs?

As for monitoring, I like atop for instant feedback.
For more in-depth analysis (and for when you're not watching), collectd
with graphite serve me well.
 
> - I have tried applying what tuning I can to the SSD block devices (noop
> scheduler etc) - no difference
> - I have removed any sort of smarts around IO grouping in SCST - no
> major impact
> - I have tried tuning up filesystore  queue and wbthrottle values but
> could not find much difference from that.
> - Read performance is excellent, the RBD devices show little to no rwait
> and I can do benchmarks up over 1GB/s in some tests.  Write throughput
> can also be good (~700MB/s).
> - I have tried using different RBD orders more in line with the iSCSI
> client block sizes (i.e 32K, 128K instead of 4M) but it seemed to make
> things worse.  I would have thought better alignment would reduce
> latency but is that offset buy the extra overhead in object work?
> 
> What I am looking for is what other areas do I need to look or
> diagnostics do I need to work this out?  We would really like to use
> ceph across a mixed workload that includes some DB systems that are
> fairly latency sensitive, but as it stands its hard to be confident in
> the performance when a fairly quiet unloaded system seems to struggle,
> even with all