Re: [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Matt Thu, 05 Feb 2015 08:00:11 -0800

Thanks Pranith, Will do. Sunday night we put some things in place seemto be mitigating it and thankfully haven't seen it again, but if we doI'll send the profile info to the list. I was able to collect someprofile info under normal load.

We added some caching to some files we noticed had become reallypopular, and when that didn't entirely stop the problem, also stoppedthe most recently added gluster volume. It's odd that volume would haveany impact as it was only used to archive backups and was almost neveractive, but several times we'd stop it during the month just because itwas most recently added and the issue would go away, start it back upand it would come back. Since then it's been quiet.

On Thu, Feb 5, 2015 at 5:14 AM, Pranith Kumar Karampuri<pkara...@redhat.com> wrote:

On 02/03/2015 11:16 AM, Matt wrote:
Hello List,
So I've been frustraded by intermittent performance problemsthroughout January. The problem occurs on a two node setup running3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes anhour for sometimes weeks at a time (I have extensive graphs inOpenNMS) our Gluster boxes will get their CPUs pegged, and in vmstatthey'll show extremely high numbers of context switches andinterrupts. Eventually things calm down. During this time, memoryusage actually drops. Overall usage on the box goes from between6-10 gigs to right around 4 gigs, and stays there. That's whatreally puzzles me.
When performance is problematic, sar shows one device, the devicecorresponding to the glusterfsd problem using all the CPU doing lotsof little reads, Sometimes 70k/second, very small avg rq size, say10-12. Afraid I don't have any saved output handy, but I can try tocapture some next time it happens. I have tons of informationfrankly, but am trying to keep this reasonably brief.
There are more than a dozen volumes on this two node setup. The CPUusage is pretty much entirely contained to one volume, a 1.5 TBvolume that is just shy of 70% full. It stores uploaded files for aweb app. What I hate about this app and so am always suspicious of,is that it stores a directory for every user in one level, so underthe /data directory in the volume, there are 450,000 sub directoriesat this point.
The only real mitigation step that's been taken so far was to turnoff the self-heal daemon on the volume, as I thought maybe crawlingthat large directory was getting expensive. This doesn't seem tohave done anything as the problem still occurs.
At this point I figure there are one of two things sorts of thingshappening really broadly: one we're running into some sort of bug orperformance problem with gluster we should either fix perhaps byupgrading or tuning around, or two, some process we're running butnot aware of is hammering the file system causing problems.
If it's the latter option, can anyone give me any tips on figuringout what might be hammering the system? I can use volume top to seewhat a brick is doing, but I can't figure out how to tell whatclients are doing what.
Apologies for the somewhat broad nature of the question, any inputthoughts would be much appreciated. I can certainly provide moreinfo about some things if it would help, but I've tried not to writea novel here.
Thanks,
Could you enable 'gluster volume profile <volname> start' for thisvolume?When next time this issue happens, keep collecting 'gluster volumeprofile <volname> info' outputs. Mail them and lets see what ishappening.
Pranith
-Matt


_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

Reply via email to