Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Vijay Bellur

On 08/11/2012 10:16 PM, Harry Mangalam wrote:



On Sat, Aug 11, 2012 at 9:41 AM, Brian Candler 



Maybe worth trying an strace (strace -f -p  2>strace.out) on the
glusterfsd process, or whatever it is which is causing the high
load, during
such a burst, just for a few seconds. The output might give some clues.


Good idea.  I'll watch and when it goes wacko and post the  filtered
results.



It might also be useful to turn on volume profiling and capture IO 
statistics when the load is high. 'volume profile  info' 
provides both cumulative and interval statistics (interval being the 
window between execution of two consecutive volume profile infos).


Thanks,
Vijay
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Joe Julian
Check your client logs. I have seen that with network issues causing 
disconnects. 

Harry Mangalam  wrote:

>Thanks for your comments.
>
>I use mdadm on many servers and I've seen md numbering like this a fair
>bit. Usually it occurs after a another RAID has been created and the
>numbering shifts.  Neil Brown (mdadm's author) , seems to think it's fine.
> So I don't think that's the problem.  And you're right - this is a
>Frankengluster made from a variety of chassis and controllers and normally
>it's fine.   As Brian noted, it's all the same to gluster, mod some small
>local differences in IO performance.
>
>Re the size difference, I'll explicitly rebalance the brick after the
>fix-layout finishes, but I'm even more worried about this fantastic
>increase in CPU usage and its effect on user performance.
>
>In the fix-layout routines (still running), I've seen CPU usage of
>glusterfsd rise to ~400% and loadavg go up to >15 on all the servers
>(except the pbs3, the one that originally had that problem).  That high
>load does not last long tho (maybe a few mintes - we've just installed
>nagios on these nodes and I'm getting a ton of emails about load increasing
>and then decreasing on all the nodes (except pbs3).  When the load goes
>very high on a server node, the user-end performance drops appreciably.
>
>hjm
>
>
>
>On Sat, Aug 11, 2012 at 4:20 AM, Brian Candler  wrote:
>
>> On Sat, Aug 11, 2012 at 12:11:39PM +0100, Nux! wrote:
>> > On 10.08.2012 22:16, Harry Mangalam wrote:
>> > >pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
>> >
>> > Harry,
>> >
>> > The name of that md device (127) indicated there may be something
>> > dodgy going on there. A device shouldn't be named 127 unless some
>> > problems occured. Are you sure your drives are OK?
>>
>> I have systems with /dev/md127 all the time, and there's no problem. It
>> seems to number downwards from /dev/md127 - if I create md array on the
>> same
>> system it is /dev/md126.
>>
>> However, this does suggest that the nodes are not configured identically:
>> two are /dev/sda or /dev/sdb, which suggests either plain disk or hardware
>> RAID, while two are /dev/md0 or /dev/127, which is software RAID.
>>
>> Although this could explain performance differences between the nodes, this
>> is transparent to gluster and doesn't explain why the files are unevenly
>> balanced - unless there is one huge file which happens to have been
>> allocated to this node.
>>
>> Regards,
>>
>> Brian.
>>
>> ___
>> Gluster-users mailing list
>> Gluster-users@gluster.org
>> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>>
>
>
>
>-- 
>Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
>[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
>415 South Circle View Dr, Irvine, CA, 92697 [shipping]
>MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
>
>___
>Gluster-users mailing list
>Gluster-users@gluster.org
>http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Harry Mangalam
On Sat, Aug 11, 2012 at 9:41 AM, Brian Candler  wrote:

> On Sat, Aug 11, 2012 at 08:31:51AM -0700, Harry Mangalam wrote:
> >Re the size difference, I'll explicitly rebalance the brick after the
> >fix-layout finishes, but I'm even more worried about this fantastic
> >increase in CPU usage and its effect on user performance.
>
> This presumably means you were originally running the cluster with fewer
> nodes, and then added some later?
>

No, but the unbalanced current situation suggests that at some point, it
got out of balance.


>
> >In the fix-layout routines (still running), I've seen CPU usage of
> >glusterfsd rise to ~400% and loadavg go up to >15 on all the servers
> >(except the pbs3, the one that originally had that problem).  That
> high
> >load does not last long tho (maybe a few mintes - we've just installed
> >nagios on these nodes and I'm getting a ton of emails about load
> >increasing and then decreasing on all the nodes (except pbs3).  When
> >the load goes very high on a server node, the user-end performance
> >drops appreciably.
>
> Maybe worth trying an strace (strace -f -p  2>strace.out) on the
> glusterfsd process, or whatever it is which is causing the high load,
> during
> such a burst, just for a few seconds. The output might give some clues.
>

Good idea.  I'll watch and when it goes wacko and post the  filtered
results.

Thanks
Harry



-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Brian Candler
On Sat, Aug 11, 2012 at 08:31:51AM -0700, Harry Mangalam wrote:
>Re the size difference, I'll explicitly rebalance the brick after the
>fix-layout finishes, but I'm even more worried about this fantastic
>increase in CPU usage and its effect on user performance.

This presumably means you were originally running the cluster with fewer
nodes, and then added some later?

>In the fix-layout routines (still running), I've seen CPU usage of
>glusterfsd rise to ~400% and loadavg go up to >15 on all the servers
>(except the pbs3, the one that originally had that problem).  That high
>load does not last long tho (maybe a few mintes - we've just installed
>nagios on these nodes and I'm getting a ton of emails about load
>increasing and then decreasing on all the nodes (except pbs3).  When
>the load goes very high on a server node, the user-end performance
>drops appreciably.

Maybe worth trying an strace (strace -f -p  2>strace.out) on the
glusterfsd process, or whatever it is which is causing the high load, during
such a burst, just for a few seconds. The output might give some clues.
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Harry Mangalam
Thanks for your comments.

I use mdadm on many servers and I've seen md numbering like this a fair
bit. Usually it occurs after a another RAID has been created and the
numbering shifts.  Neil Brown (mdadm's author) , seems to think it's fine.
 So I don't think that's the problem.  And you're right - this is a
Frankengluster made from a variety of chassis and controllers and normally
it's fine.   As Brian noted, it's all the same to gluster, mod some small
local differences in IO performance.

Re the size difference, I'll explicitly rebalance the brick after the
fix-layout finishes, but I'm even more worried about this fantastic
increase in CPU usage and its effect on user performance.

In the fix-layout routines (still running), I've seen CPU usage of
glusterfsd rise to ~400% and loadavg go up to >15 on all the servers
(except the pbs3, the one that originally had that problem).  That high
load does not last long tho (maybe a few mintes - we've just installed
nagios on these nodes and I'm getting a ton of emails about load increasing
and then decreasing on all the nodes (except pbs3).  When the load goes
very high on a server node, the user-end performance drops appreciably.

hjm



On Sat, Aug 11, 2012 at 4:20 AM, Brian Candler  wrote:

> On Sat, Aug 11, 2012 at 12:11:39PM +0100, Nux! wrote:
> > On 10.08.2012 22:16, Harry Mangalam wrote:
> > >pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
> >
> > Harry,
> >
> > The name of that md device (127) indicated there may be something
> > dodgy going on there. A device shouldn't be named 127 unless some
> > problems occured. Are you sure your drives are OK?
>
> I have systems with /dev/md127 all the time, and there's no problem. It
> seems to number downwards from /dev/md127 - if I create md array on the
> same
> system it is /dev/md126.
>
> However, this does suggest that the nodes are not configured identically:
> two are /dev/sda or /dev/sdb, which suggests either plain disk or hardware
> RAID, while two are /dev/md0 or /dev/127, which is software RAID.
>
> Although this could explain performance differences between the nodes, this
> is transparent to gluster and doesn't explain why the files are unevenly
> balanced - unless there is one huge file which happens to have been
> allocated to this node.
>
> Regards,
>
> Brian.
>
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
>



-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Brian Candler
On Sat, Aug 11, 2012 at 12:11:39PM +0100, Nux! wrote:
> On 10.08.2012 22:16, Harry Mangalam wrote:
> >pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
> 
> Harry,
> 
> The name of that md device (127) indicated there may be something
> dodgy going on there. A device shouldn't be named 127 unless some
> problems occured. Are you sure your drives are OK?

I have systems with /dev/md127 all the time, and there's no problem. It
seems to number downwards from /dev/md127 - if I create md array on the same
system it is /dev/md126.

However, this does suggest that the nodes are not configured identically:
two are /dev/sda or /dev/sdb, which suggests either plain disk or hardware
RAID, while two are /dev/md0 or /dev/127, which is software RAID.

Although this could explain performance differences between the nodes, this
is transparent to gluster and doesn't explain why the files are unevenly
balanced - unless there is one huge file which happens to have been
allocated to this node.

Regards,

Brian.

___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


Re: [Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-11 Thread Nux!

On 10.08.2012 22:16, Harry Mangalam wrote:

pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---


Harry,

The name of that md device (127) indicated there may be something dodgy 
going on there. A device shouldn't be named 127 unless some problems 
occured. Are you sure your drives are OK?


--
Sent from the Delta quadrant using Borg technology!

Nux!
www.nux.ro
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users


[Gluster-users] 1/4 glusterfsd's runs amok; performance suffers;

2012-08-10 Thread Harry Mangalam
running 3.3 distributed on IPoIB on 4 nodes, 1 brick per node.  Any idea
why, on one of those nodes, glusterfsd would go berserk, running up to 370%
CPU and driving load to >30 (file performance on the clients slows to a
crawl). While very slow, it continued to serve out files. This is the
second time this has happened in about a week. I had turned on the gluster
nfs services, but wasn't using it when this happened.  It's now off.

kill -HUP did nothing to either glusterd or glusterfsd, so I had to kill
both and restart glusterd. That solved the overload on glusterfsd and
performance is back to near normal. I'm now doing a rebalance/fix-layout
which is running as expected, but will take the weekend to complete.  I did
notice that the affected node (pbs3) has more files than the others, tho
I'm not sure that this is significant.

Filesystem   Size  Used Avail Use% Mounted on
pbs1:/dev/sdb6.4T  1.9T  4.6T  29% /bducgl
pbs2:/dev/md08.2T  2.4T  5.9T  30% /bducgl
pbs3:/dev/md127  8.2T  5.9T  2.3T  73% /bducgl  <---
pbs4:/dev/sda6.4T  1.8T  4.6T  29% /bducgl


-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users