Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
On Thu, Aug 21, 2014 at 07:40:45AM +, Dan Van Der Ster wrote:
> On 20 Aug 2014, at 17:54, Hugo Mills  wrote:
> >> Does your hardware provide enough IOPS for what your users need?
> >> (e.g. what is the op/s from ceph -w)
> > 
> >   Not really an answer to your question, but: Before the ceph cluster
> > went in, we were running the system on two 5-year-old NFS servers for
> > a while. We have about half the total number of spindles that we used
> > to, but more modern drives.
> 
> NFS exported async or sync? If async, it can’t be compared to
> CephFS. Also, if those NFS servers had RAID cards with a wb-cache,
> it can’t really be compared.

   Hmm. Yes, async. Probably wouldn't have been my choice... (I only
started working with this system recently -- about the same time that
the ceph cluster was deployed to replace the older machines. I haven't
had much of say in what's implemented here, but I have to try to
support it.)

   I'm tempted to put the users' home directories back on an NFS
server, and keep ceph for the research data. That at least should give
us more in the way of interactivity (which is the main thing I'm
getting complaints about).

> >   I'll look at how the op/s values change when we have the problem.
> > At the moment (with what I assume to be normal desktop usage from the
> > 3-4 users in the lab), they're flapping wildly somewhere around a
> > median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
> > read and write.

> Another tunable to look at is the filestore max sync interval — in
> my experience the colocated journal/OSD setup suffers with the
> default (5s, IIRC), especially when an OSD is getting a constant
> stream of writes. When this happens, the disk heads are constantly
> seeking back and forth between synchronously writing to the journal
> and flushing the outstanding writes. If we would have a dedicated
> (spinning) disk for the journal, then the synchronous writes (to the
> journal) could be done sequentially (thus, quickly) and the flushes
> would also be quick(er). SSD journals can obviously also help with
> this.

   Not sure what you mean about colocated journal/OSD. The journals
aren't on the same device as the OSDs. However, all three journals on
each machine are on the same SSD.

> For a short test I would try increasing filestore max sync interval
> to 30s or maybe even 60s to see if it helps. (I know that at least
> one of the Inktank experts advise against changing the filestore max
> sync interval — but in my experience 5s is much too short for the
> colocated journal setup.) You need to make sure your journals are
> large enough to store 30/60s of writes, but when you have
> predominantly small writes even a few GB of journal ought to be
> enough.

   I'll have a play with that.

   Thanks for all the help so far -- it's been useful. I'm learning
what the right kind of questions are.

   Hugo.

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers :: x6943 :: R07 Harry Pitt Building
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Serious performance problems with small file writes

2014-08-21 Thread Hugo Mills
   Just to fill in some of the gaps from yesterday's mail:

On Wed, Aug 20, 2014 at 04:54:28PM +0100, Hugo Mills wrote:
>Some questions below I can't answer immediately, but I'll spend
> tomorrow morning irritating people by triggering these events (I think
> I have a reproducer -- unpacking a 1.2 GiB tarball with 25 small
> files in it) and giving you more details. 

   Yes, the tarball with the 25 small files in it is definitely a
reproducer.

[snip]
> > What about iostat on the OSDs — are your OSD disks busy reading or
> > writing during these incidents?
> 
>Not sure. I don't think so, but I'll try to trigger an incident and
> report back on this one.

   Mostly writing. I'm seeing figures of up to about 2-3 MB/s writes,
and 200-300 kB/s reads on all three, but it fluctuates a lot (with
5-second intervals). Sample data at the end of the email.

> > What are you using for OSD journals?
> 
>On each machine, the three OSD journals live on the same ext4
> filesystem on an SSD, which is also the root filesystem of the
> machine.
> 
> > Also check the CPU usage for the mons and osds...
> 
>The mons are doing pretty much nothing in terms of CPU, as far as I
> can see. I will double-check during an incident.

   The mons are just ticking over with a <1% CPU usage.

> > Does your hardware provide enough IOPS for what your users need?
> > (e.g. what is the op/s from ceph -w)
> 
>Not really an answer to your question, but: Before the ceph cluster
> went in, we were running the system on two 5-year-old NFS servers for
> a while. We have about half the total number of spindles that we used
> to, but more modern drives.
> 
>I'll look at how the op/s values change when we have the problem.
> At the moment (with what I assume to be normal desktop usage from the
> 3-4 users in the lab), they're flapping wildly somewhere around a
> median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
> read and write.

   With minimal users and one machine running the tar unpacking
process, I'm getting somewhere around 100-200 op/s on the ceph
cluster, but interactivity on the desktop machine I'm logged in on is
horrible -- I'm frequently getting tens of seconds of latency. Compare
that to the (relatively) comfortable 350-400 op/s we had yesterday
with what is probably workloads with larger files.

> > If disabling deep scrub helps, then it might be that something else
> > is reading the disks heavily. One thing to check is updatedb — we
> > had to disable it from indexing /var/lib/ceph on our OSDs.
> 
>I haven't seen that running at all during the day, but I'll look
> into it.

   No, it's not anything like that -- iotop reports pretty much the
only things doing IO are ceph-osd and the occasional xfsaild.

   Hugo.

>Hugo.
> 
> > Best Regards,
> > Dan
> > 
> > -- Dan van der Ster || Data & Storage Services || CERN IT Department --
> > 
> > 
> > On 20 Aug 2014, at 16:39, Hugo Mills  wrote:
> > 
> > >   We have a ceph system here, and we're seeing performance regularly
> > > descend into unusability for periods of minutes at a time (or longer).
> > > This appears to be triggered by writing large numbers of small files.
> > > 
> > >   Specifications:
> > > 
> > > ceph 0.80.5
> > > 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
> > > 2 machines running primary and standby MDS
> > > 3 monitors on the same machines as the OSDs
> > > Infiniband to about 8 CephFS clients (headless, in the machine room)
> > > Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
> > >   machines, in the analysis lab)
> > > 
> > >   The cluster stores home directories of the users and a larger area
> > > of scientific data (approx 15 TB) which is being processed and
> > > analysed by the users of the cluster.
> > > 
> > >   We have a relatively small number of concurrent users (typically
> > > 4-6 at most), who use GUI tools to examine their data, and then
> > > complex sets of MATLAB scripts to process it, with processing often
> > > being distributed across all the machines using Condor.
> > > 
> > >   It's not unusual to see the analysis scripts write out large
> > > numbers (thousands, possibly tens or hundreds of thousands) of small
> > > files, often from many client machines at once in parallel. When this
> > > happens, the ceph cluster becomes almost completely unresponsive for
> > > tens of seconds (or even for minutes) at a tim

Re: [ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
conservative, and there are options to cache more entries,
> etc…)

   Not much. We have:

[global]
mds inline_data = true
mds shutdown check = 2
mds cache size = 75

[mds]
mds client prealloc inos = 1

> What about iostat on the OSDs — are your OSD disks busy reading or
> writing during these incidents?

   Not sure. I don't think so, but I'll try to trigger an incident and
report back on this one.

> What are you using for OSD journals?

   On each machine, the three OSD journals live on the same ext4
filesystem on an SSD, which is also the root filesystem of the
machine.

> Also check the CPU usage for the mons and osds...

   The mons are doing pretty much nothing in terms of CPU, as far as I
can see. I will double-check during an incident.

> Does your hardware provide enough IOPS for what your users need?
> (e.g. what is the op/s from ceph -w)

   Not really an answer to your question, but: Before the ceph cluster
went in, we were running the system on two 5-year-old NFS servers for
a while. We have about half the total number of spindles that we used
to, but more modern drives.

   I'll look at how the op/s values change when we have the problem.
At the moment (with what I assume to be normal desktop usage from the
3-4 users in the lab), they're flapping wildly somewhere around a
median of 350-400, with peaks up to 800. Somewhere around 15-20 MB/s
read and write.

> If disabling deep scrub helps, then it might be that something else
> is reading the disks heavily. One thing to check is updatedb — we
> had to disable it from indexing /var/lib/ceph on our OSDs.

   I haven't seen that running at all during the day, but I'll look
into it.

   Hugo.

> Best Regards,
> Dan
> 
> -- Dan van der Ster || Data & Storage Services || CERN IT Department --
> 
> 
> On 20 Aug 2014, at 16:39, Hugo Mills  wrote:
> 
> >   We have a ceph system here, and we're seeing performance regularly
> > descend into unusability for periods of minutes at a time (or longer).
> > This appears to be triggered by writing large numbers of small files.
> > 
> >   Specifications:
> > 
> > ceph 0.80.5
> > 6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
> > 2 machines running primary and standby MDS
> > 3 monitors on the same machines as the OSDs
> > Infiniband to about 8 CephFS clients (headless, in the machine room)
> > Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
> >   machines, in the analysis lab)
> > 
> >   The cluster stores home directories of the users and a larger area
> > of scientific data (approx 15 TB) which is being processed and
> > analysed by the users of the cluster.
> > 
> >   We have a relatively small number of concurrent users (typically
> > 4-6 at most), who use GUI tools to examine their data, and then
> > complex sets of MATLAB scripts to process it, with processing often
> > being distributed across all the machines using Condor.
> > 
> >   It's not unusual to see the analysis scripts write out large
> > numbers (thousands, possibly tens or hundreds of thousands) of small
> > files, often from many client machines at once in parallel. When this
> > happens, the ceph cluster becomes almost completely unresponsive for
> > tens of seconds (or even for minutes) at a time, until the writes are
> > flushed through the system. Given the nature of modern GUI desktop
> > environments (often reading and writing small state files in the
> > user's home directory), this means that desktop interactiveness and
> > responsiveness for all the other users of the cluster suffer.
> > 
> >   1-minute load on the servers typically peaks at about 8 during
> > these events (on 4-core machines). Load on the clients also peaks
> > high, because of the number of processes waiting for a response from
> > the FS. The MDS shows little sign of stress -- it seems to be entirely
> > down to the OSDs. ceph -w shows requests blocked for more than 10
> > seconds, and in bad cases, ceph -s shows up to many hundreds of
> > requests blocked for more than 32s.
> > 
> >   We've had to turn off scrubbing and deep scrubbing completely --
> > except between 01.00 and 04.00 every night -- because it triggers the
> > exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
> > up to 7 PGs being scrubbed, as it did on Monday, it's completely
> > unusable.
> > 
> >   Is this problem something that's often seen? If so, what are the
> > best options for mitigation or elimination of the problem? I've found
> > a few references to issue #6278 [1], but that seems to

[ceph-users] Serious performance problems with small file writes

2014-08-20 Thread Hugo Mills
   We have a ceph system here, and we're seeing performance regularly
descend into unusability for periods of minutes at a time (or longer).
This appears to be triggered by writing large numbers of small files.

   Specifications:

ceph 0.80.5
6 machines running 3 OSDs each (one 4 TB rotational HD per OSD, 2 threads)
2 machines running primary and standby MDS
3 monitors on the same machines as the OSDs
Infiniband to about 8 CephFS clients (headless, in the machine room)
Gigabit ethernet to a further 16 or so CephFS clients (Linux desktop
   machines, in the analysis lab)

   The cluster stores home directories of the users and a larger area
of scientific data (approx 15 TB) which is being processed and
analysed by the users of the cluster.

   We have a relatively small number of concurrent users (typically
4-6 at most), who use GUI tools to examine their data, and then
complex sets of MATLAB scripts to process it, with processing often
being distributed across all the machines using Condor.

   It's not unusual to see the analysis scripts write out large
numbers (thousands, possibly tens or hundreds of thousands) of small
files, often from many client machines at once in parallel. When this
happens, the ceph cluster becomes almost completely unresponsive for
tens of seconds (or even for minutes) at a time, until the writes are
flushed through the system. Given the nature of modern GUI desktop
environments (often reading and writing small state files in the
user's home directory), this means that desktop interactiveness and
responsiveness for all the other users of the cluster suffer.

   1-minute load on the servers typically peaks at about 8 during
these events (on 4-core machines). Load on the clients also peaks
high, because of the number of processes waiting for a response from
the FS. The MDS shows little sign of stress -- it seems to be entirely
down to the OSDs. ceph -w shows requests blocked for more than 10
seconds, and in bad cases, ceph -s shows up to many hundreds of
requests blocked for more than 32s.

   We've had to turn off scrubbing and deep scrubbing completely --
except between 01.00 and 04.00 every night -- because it triggers the
exact same symptoms, even with only 2-3 PGs being scrubbed. If it gets
up to 7 PGs being scrubbed, as it did on Monday, it's completely
unusable.

   Is this problem something that's often seen? If so, what are the
best options for mitigation or elimination of the problem? I've found
a few references to issue #6278 [1], but that seems to be referencing
scrub specifically, not ordinary (if possibly pathological) writes.

   What are the sorts of things I should be looking at to work out
where the bottleneck(s) are? I'm a bit lost about how to drill down
into the ceph system for identifying performance issues. Is there a
useful guide to tools somewhere?

   Is an upgrade to 0.84 likely to be helpful? How "development" are
the development releases, from a stability / dangerous bugs point of
view?

   Thanks,
   Hugo.

[1] http://tracker.ceph.com/issues/6278

-- 
Hugo Mills :: IT Services, University of Reading
Specialist Engineer, Research Servers
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com