Thanks Pranith, Will do. Sunday night we put some things in place seem
to be mitigating it and thankfully haven't seen it again, but if we do
I'll send the profile info to the list. I was able to collect some
profile info under normal load.
We added some caching to some files we noticed had become really
popular, and when that didn't entirely stop the problem, also stopped
the most recently added gluster volume. It's odd that volume would have
any impact as it was only used to archive backups and was almost never
active, but several times we'd stop it during the month just because it
was most recently added and the issue would go away, start it back up
and it would come back. Since then it's been quiet.
On Thu, Feb 5, 2015 at 5:14 AM, Pranith Kumar Karampuri
<pkara...@redhat.com> wrote:
On 02/03/2015 11:16 AM, Matt wrote:
Hello List,
So I've been frustraded by intermittent performance problems
throughout January. The problem occurs on a two node setup running
3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an
hour for sometimes weeks at a time (I have extensive graphs in
OpenNMS) our Gluster boxes will get their CPUs pegged, and in vmstat
they'll show extremely high numbers of context switches and
interrupts. Eventually things calm down. During this time, memory
usage actually drops. Overall usage on the box goes from between
6-10 gigs to right around 4 gigs, and stays there. That's what
really puzzles me.
When performance is problematic, sar shows one device, the device
corresponding to the glusterfsd problem using all the CPU doing lots
of little reads, Sometimes 70k/second, very small avg rq size, say
10-12. Afraid I don't have any saved output handy, but I can try to
capture some next time it happens. I have tons of information
frankly, but am trying to keep this reasonably brief.
There are more than a dozen volumes on this two node setup. The CPU
usage is pretty much entirely contained to one volume, a 1.5 TB
volume that is just shy of 70% full. It stores uploaded files for a
web app. What I hate about this app and so am always suspicious of,
is that it stores a directory for every user in one level, so under
the /data directory in the volume, there are 450,000 sub directories
at this point.
The only real mitigation step that's been taken so far was to turn
off the self-heal daemon on the volume, as I thought maybe crawling
that large directory was getting expensive. This doesn't seem to
have done anything as the problem still occurs.
At this point I figure there are one of two things sorts of things
happening really broadly: one we're running into some sort of bug or
performance problem with gluster we should either fix perhaps by
upgrading or tuning around, or two, some process we're running but
not aware of is hammering the file system causing problems.
If it's the latter option, can anyone give me any tips on figuring
out what might be hammering the system? I can use volume top to see
what a brick is doing, but I can't figure out how to tell what
clients are doing what.
Apologies for the somewhat broad nature of the question, any input
thoughts would be much appreciated. I can certainly provide more
info about some things if it would help, but I've tried not to write
a novel here.
Thanks,
Could you enable 'gluster volume profile <volname> start' for this
volume?
When next time this issue happens, keep collecting 'gluster volume
profile <volname> info' outputs. Mail them and lets see what is
happening.
Pranith
-Matt
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users