Re: [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

2015-02-05 Thread Matt
Thanks Pranith, Will do. Sunday night we put some things in place seem 
to be mitigating it and thankfully haven't seen it again, but if we do 
I'll send the profile info to the list. I was able to collect some 
profile info under normal load.


We added some caching to some files we noticed had become really 
popular, and when that didn't entirely stop the problem, also stopped 
the most recently added gluster volume. It's odd that volume would have 
any impact as it was only used to archive backups and was almost never 
active, but several times we'd stop it during the month just because it 
was most recently added and the issue would go away, start it back up 
and it would come back. Since then it's been quiet.


On Thu, Feb 5, 2015 at 5:14 AM, Pranith Kumar Karampuri 
 wrote:


On 02/03/2015 11:16 AM, Matt wrote:

Hello List,

So I've been frustraded by intermittent performance problems 
throughout January. The problem occurs on a two node setup running 
3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an 
hour for sometimes weeks at a time (I have extensive graphs in 
OpenNMS) our Gluster boxes will get their CPUs pegged, and in vmstat 
they'll show extremely high numbers of context switches and 
interrupts. Eventually things calm down. During this time, memory 
usage actually drops. Overall usage on the box goes from between 
6-10 gigs to right around 4 gigs, and stays there. That's what 
really puzzles me.


When performance is problematic, sar shows one device, the device 
corresponding to the glusterfsd problem using all the CPU doing lots 
of little reads, Sometimes 70k/second, very small avg rq size, say 
10-12. Afraid I don't have any saved output handy, but I can try to 
capture some next time it happens. I have tons of information 
frankly, but am trying to keep this reasonably brief.


There are more than a dozen volumes on this two node setup. The CPU 
usage is pretty much entirely contained to one volume, a 1.5 TB 
volume that is just shy of 70% full. It stores uploaded files for a 
web app. What I hate about this app and so am always suspicious of, 
is that it stores a directory for every user in one level, so under 
the /data directory in the volume, there are 450,000 sub directories 
at this point.


The only real mitigation step that's been taken so far was to turn 
off the self-heal daemon on the volume, as I thought maybe crawling 
that large directory was getting expensive. This doesn't seem to 
have done anything as the problem still occurs.


At this point I figure there are one of two things sorts of things 
happening really broadly: one we're running into some sort of bug or 
performance problem with gluster we should either fix perhaps by 
upgrading or tuning around, or two, some process we're running but 
not aware of is hammering the file system causing problems.


If it's the latter option, can anyone give me any tips on figuring 
out what might be hammering the system? I can use volume top to see 
what a brick is doing, but I can't figure out how to tell what 
clients are doing what.


Apologies for the somewhat broad nature of the question, any input 
thoughts would be much appreciated. I can certainly provide more 
info about some things if it would help, but I've tried not to write 
a novel here.


Thanks,
Could you enable 'gluster volume profile  start' for this 
volume?
When next time this issue happens, keep collecting 'gluster volume 
profile  info' outputs. Mail them and lets see what is 
happening.


Pranith


-Matt


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

2015-02-05 Thread Pranith Kumar Karampuri


On 02/03/2015 11:16 AM, Matt wrote:

Hello List,

So I've been frustraded by intermittent performance problems 
throughout January. The problem occurs on a two node setup running 
3.4.5, 16 gigs of ram with a bunch of local disk. For sometimes an 
hour for sometimes weeks at a time (I have extensive graphs in 
OpenNMS) our Gluster boxes will get their CPUs pegged, and in vmstat 
they'll show extremely high numbers of context switches and 
interrupts. Eventually things calm down. During this time, memory 
usage actually drops. Overall usage on the box goes from between 6-10 
gigs to right around 4 gigs, and stays there. That's what really 
puzzles me.


When performance is problematic, sar shows one device, the device 
corresponding to the glusterfsd problem using all the CPU doing lots 
of little reads, Sometimes 70k/second, very small avg rq size, say 
10-12. Afraid I don't have any saved output handy, but I can try to 
capture some next time it happens. I have tons of information frankly, 
but am trying to keep this reasonably brief.


There are more than a dozen volumes on this two node setup. The CPU 
usage is pretty much entirely contained to one volume, a 1.5 TB volume 
that is just shy of 70% full. It stores uploaded files for a web app. 
What I hate about this app and so am always suspicious of, is that it 
stores a directory for every user in one level, so under the /data 
directory in the volume, there are 450,000 sub directories at this point.


The only real mitigation step that's been taken so far was to turn off 
the self-heal daemon on the volume, as I thought maybe crawling that 
large directory was getting expensive. This doesn't seem to have done 
anything as the problem still occurs.


At this point I figure there are one of two things sorts of things 
happening really broadly: one we're running into some sort of bug or 
performance problem with gluster we should either fix perhaps by 
upgrading or tuning around, or two, some process we're running but not 
aware of is hammering the file system causing problems.


If it's the latter option, can anyone give me any tips on figuring out 
what might be hammering the system? I can use volume top to see what a 
brick is doing, but I can't figure out how to tell what clients are 
doing what.


Apologies for the somewhat broad nature of the question, any input 
thoughts would be much appreciated. I can certainly provide more info 
about some things if it would help, but I've tried not to write a 
novel here.


Thanks,

Could you enable 'gluster volume profile  start' for this volume?
When next time this issue happens, keep collecting 'gluster volume 
profile  info' outputs. Mail them and lets see what is happening.


Pranith


-Matt


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

2015-02-03 Thread Matt


I’ve been trying for weeks to reproduce the performance problems in 
our preproduction environments but can’t. As a result, selling that 
just upgrading to 3.6.x and hoping it goes away might be tricky. 3.6 is 
perceived as a little too bleeding edge, and we’ve actually had some 
other not fully explained issues with this cluster recently that make 
us hesitate. I don’t think they’re related.


On Tue, Feb 3, 2015 at 4:58 AM, Justin Clift  wrote:

- Original Message -

 Hello List,

 So I've been frustraded by intermittent performance problems 
throughout
 January. The problem occurs on a two node setup running 3.4.5, 16 
gigs
 of ram with a bunch of local disk. For sometimes an hour for 
sometimes
 weeks at a time (I have extensive graphs in OpenNMS) our Gluster 
boxes
 will get their CPUs pegged, and in vmstat they'll show extremely 
high

 numbers of context switches and interrupts. Eventually things calm
 down. During this time, memory usage actually drops. Overall usage 
on
 the box goes from between 6-10 gigs to right around 4 gigs, and 
stays

 there. That's what really puzzles me.

 When performance is problematic, sar shows one device, the device
 corresponding to the glusterfsd problem using all the CPU doing 
lots of
 little reads, Sometimes 70k/second, very small avg rq size, say 
10-12.

 Afraid I don't have any saved output handy, but I can try to capture
 some next time it happens. I have tons of information frankly, but 
am

 trying to keep this reasonably brief.

 There are more than a dozen volumes on this two node setup. The CPU
 usage is pretty much entirely contained to one volume, a 1.5 TB 
volume
 that is just shy of 70% full. It stores uploaded files for a web 
app.
 What I hate about this app and so am always suspicious of, is that 
it

 stores a directory for every user in one level, so under the /data
 directory in the volume, there are 450,000 sub directories at this
 point.

 The only real mitigation step that's been taken so far was to turn 
off

 the self-heal daemon on the volume, as I thought maybe crawling that
 large directory was getting expensive. This doesn't seem to have 
done

 anything as the problem still occurs.

 At this point I figure there are one of two things sorts of things
 happening really broadly: one we're running into some sort of bug or
 performance problem with gluster we should either fix perhaps by
 upgrading or tuning around, or two, some process we're running but 
not

 aware of is hammering the file system causing problems.

 If it's the latter option, can anyone give me any tips on figuring 
out
 what might be hammering the system? I can use volume top to see 
what a

 brick is doing, but I can't figure out how to tell what clients are
 doing what.

 Apologies for the somewhat broad nature of the question, any input
 thoughts would be much appreciated. I can certainly provide more 
info
 about some things if it would help, but I've tried not to write a 
novel

 here.


Out of curiosity, are you able to test using GlusterFS 3.6.2?  We've
had a bunch of pretty in-depth upstream testing at decent scale (100+
nodes) from 3.5.x onwards, with lots of performance issues identified
and fixed on the way through.

So, I'm kinda hopeful the problem you're describing is fixed in newer
releases. :D

Regards and best wishes,

Justin Clift

--
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users

Re: [Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

2015-02-03 Thread Justin Clift
- Original Message -
> Hello List,
> 
> So I've been frustraded by intermittent performance problems throughout
> January. The problem occurs on a two node setup running 3.4.5, 16 gigs
> of ram with a bunch of local disk. For sometimes an hour for sometimes
> weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes
> will get their CPUs pegged, and in vmstat they'll show extremely high
> numbers of context switches and interrupts. Eventually things calm
> down. During this time, memory usage actually drops. Overall usage on
> the box goes from between 6-10 gigs to right around 4 gigs, and stays
> there. That's what really puzzles me.
> 
> When performance is problematic, sar shows one device, the device
> corresponding to the glusterfsd problem using all the CPU doing lots of
> little reads, Sometimes 70k/second, very small avg rq size, say 10-12.
> Afraid I don't have any saved output handy, but I can try to capture
> some next time it happens. I have tons of information frankly, but am
> trying to keep this reasonably brief.
> 
> There are more than a dozen volumes on this two node setup. The CPU
> usage is pretty much entirely contained to one volume, a 1.5 TB volume
> that is just shy of 70% full. It stores uploaded files for a web app.
> What I hate about this app and so am always suspicious of, is that it
> stores a directory for every user in one level, so under the /data
> directory in the volume, there are 450,000 sub directories at this
> point.
> 
> The only real mitigation step that's been taken so far was to turn off
> the self-heal daemon on the volume, as I thought maybe crawling that
> large directory was getting expensive. This doesn't seem to have done
> anything as the problem still occurs.
> 
> At this point I figure there are one of two things sorts of things
> happening really broadly: one we're running into some sort of bug or
> performance problem with gluster we should either fix perhaps by
> upgrading or tuning around, or two, some process we're running but not
> aware of is hammering the file system causing problems.
> 
> If it's the latter option, can anyone give me any tips on figuring out
> what might be hammering the system? I can use volume top to see what a
> brick is doing, but I can't figure out how to tell what clients are
> doing what.
> 
> Apologies for the somewhat broad nature of the question, any input
> thoughts would be much appreciated. I can certainly provide more info
> about some things if it would help, but I've tried not to write a novel
> here.

Out of curiosity, are you able to test using GlusterFS 3.6.2?  We've
had a bunch of pretty in-depth upstream testing at decent scale (100+
nodes) from 3.5.x onwards, with lots of performance issues identified
and fixed on the way through.

So, I'm kinda hopeful the problem you're describing is fixed in newer
releases. :D

Regards and best wishes,

Justin Clift

-- 
GlusterFS - http://www.gluster.org

An open source, distributed file system scaling to several
petabytes, and handling thousands of clients.

My personal twitter: twitter.com/realjustinclift
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Diagnosing Intermittent Performance Problems Possibly Caused by Gremlins

2015-02-02 Thread Matt

Hello List,

So I've been frustraded by intermittent performance problems throughout 
January. The problem occurs on a two node setup running 3.4.5, 16 gigs 
of ram with a bunch of local disk. For sometimes an hour for sometimes 
weeks at a time (I have extensive graphs in OpenNMS) our Gluster boxes 
will get their CPUs pegged, and in vmstat they'll show extremely high 
numbers of context switches and interrupts. Eventually things calm 
down. During this time, memory usage actually drops. Overall usage on 
the box goes from between 6-10 gigs to right around 4 gigs, and stays 
there. That's what really puzzles me.


When performance is problematic, sar shows one device, the device 
corresponding to the glusterfsd problem using all the CPU doing lots of 
little reads, Sometimes 70k/second, very small avg rq size, say 10-12. 
Afraid I don't have any saved output handy, but I can try to capture 
some next time it happens. I have tons of information frankly, but am 
trying to keep this reasonably brief.


There are more than a dozen volumes on this two node setup. The CPU 
usage is pretty much entirely contained to one volume, a 1.5 TB volume 
that is just shy of 70% full. It stores uploaded files for a web app. 
What I hate about this app and so am always suspicious of, is that it 
stores a directory for every user in one level, so under the /data 
directory in the volume, there are 450,000 sub directories at this 
point.


The only real mitigation step that's been taken so far was to turn off 
the self-heal daemon on the volume, as I thought maybe crawling that 
large directory was getting expensive. This doesn't seem to have done 
anything as the problem still occurs.


At this point I figure there are one of two things sorts of things 
happening really broadly: one we're running into some sort of bug or 
performance problem with gluster we should either fix perhaps by 
upgrading or tuning around, or two, some process we're running but not 
aware of is hammering the file system causing problems.


If it's the latter option, can anyone give me any tips on figuring out 
what might be hammering the system? I can use volume top to see what a 
brick is doing, but I can't figure out how to tell what clients are 
doing what.


Apologies for the somewhat broad nature of the question, any input 
thoughts would be much appreciated. I can certainly provide more info 
about some things if it would help, but I've tried not to write a novel 
here.


Thanks,

-Matt
___
Gluster-users mailing list
Gluster-users@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-users