Re: [Gluster-users] single problematic node (brick)

2014-05-20 Thread Franco Broi

Are you running out of memory? How much memory are the gluster daemons
using?

On Tue, 2014-05-20 at 11:16 -0700, Doug Schouten wrote: 
> Hello,
> 
>   I have a rather simple Gluster configuration that consists of 85TB 
> distributed across six nodes. There is one particular node that seems to 
> fail on a ~ weekly basis, and I can't figure out why.
> 
> I have attached my Gluster configuration and a recent log file from the 
> problematic node. For a user, when the failure occurs, the symptom is 
> that any attempts to access the Gluster volume from the problematic node 
> fails with "transport endpoint not connected" error.
> 
> Restarting the Gluster daemons and remounting the volume on the failed 
> node always fixes the problem. But usually by that point some number of 
> jobs in our batch queue have failed b/c of this issue already, and it's 
> becoming a headache.
> 
> It could be a fuse issue, since I see many related error messages in the 
> Gluster log, but I can't disentangle the various errors. The relevant 
> line in my /etc/fstab file is
> 
> server:global /global glusterfs 
> defaults,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log
>  
> 0 0
> 
> Any ideas on the source of the problem? Could it be a hardware (network) 
> glitch? The fact that it only happens on one node that is identically 
> configured (with same hardware) as other nodes points to something like 
> that.
> 
> thanks! Doug
> ___
> Gluster-users mailing list
> Gluster-users@gluster.org
> http://supercolony.gluster.org/mailman/listinfo/gluster-users


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Troubles with syscall 'lstat'

2014-05-20 Thread Franco Broi

If you are trying to use lstat+mkdir as a locking mechanism so that you
can run multiple instances of the same program, it will probably fail
more often on a Fuse filesystem than a local one. It should probably be
using FLOCK or open a file with O_CREAT|O_EXCL.


On Tue, 2014-05-20 at 11:58 +0200, Nicolas Greneche wrote: 
> Hello,
> 
> I run a glusterfs architecture in 3.3.1 version :
> 
> 
> # glusterd -V
> glusterfs 3.3.1 built on Apr 29 2013 15:17:28
> Repository revision: git://git.gluster.com/glusterfs.git
> Copyright (c) 2006-2011 Gluster Inc. 
> GlusterFS comes with ABSOLUTELY NO WARRANTY.
> You may redistribute copies of GlusterFS under the terms of the GNU 
> General Public License.
> 
> I have an odd problem when I run a software. When I run it from the 
> local filesystem it works and when I copy it to a glusterfs share it 
> produce errors.
> 
> Both instance of the program shares the same environement (the run on te 
> same instance of operating system with the same user).
> 
> The only difference I noticed is with the syscall sequence. When It 
> works I have this sequence :
> 
> write(1, "HRType:  esh_gain \n", 19)= 19
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> write(1, "TILDE\n", 6)  = 6
> stat64("/home/ngreneche/ubuntu1204/usr/local/ACE-ilProlog-1.2.20/linux/bin/tilde",
>  
> {st_mode=S_IFDIR|0750, st_size=4096, ...}) = 0
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> write(1, "Discretization busy...\n", 23) = 23
> 
> And when it doesn't work (running from a glusterfs share), I have this 
> sequence :
> 
> write(1, "HRType:  esh_gain \n", 19)= 19
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> write(1, "TILDE\n", 6)  = 6
> stat64("/home/dist/db/ubuntu1204/usr/local/ACE-ilProlog-1.2.20/linux/bin/tilde",
>  
> {st_mode=S_IFDIR|0750, st_size=16384, ...}) = 0
> mkdir("/home/dist/db/ubuntu1204/usr/local/ACE-ilProlog-1.2.20/linux/bin/tilde",
>  
> 0755) = -1 EEXIST (File exists)
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
> ...}) = 0
> write(1, "An error occurred during the exe"..., 49) = 49
> 
> The only difference is that a mkdir is performed just after the stat. 
> Whereas the stat check if the directory exists to trigger or not a mkdir.
> 
> My underlying filesystem on the brick is ext4.
> 
> Do you know if there are some issues with stat in the version of glusterfs ?
> 
> Regards,
> 


___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users


[Gluster-users] Gluster crashes

2014-05-20 Thread Jarsulic, Michael [BSD] - CRI
I have been having issues with Gluster for the past couple of weeks on my 
scratch server (mostly stability). Today, Gluster keeps crashing and will only 
stay up for a few seconds at a time. The nfs log file contains these messages:

READDIRP(40)) called at 2014-05-20 14:56:58.933291
[2014-05-20 15:01:51.568960] E [client3_1-fops.c:1937:client3_1_readdirp_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.568994] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) 
[0x7f193f7b0729] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) 
[0x7f193f7afeee] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f193f7afe5e]))) 0-hpcscratch-client-0: forced unwinding frame 
type(GlusterFS 3.1) op(READDIRP(40)) called at 2014-05-20 14:56:58.933348
[2014-05-20 15:01:51.569007] E [client3_1-fops.c:1937:client3_1_readdirp_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569043] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) 
[0x7f193f7b0729] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) 
[0x7f193f7afeee] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f193f7afe5e]))) 0-hpcscratch-client-0: forced unwinding frame 
type(GlusterFS 3.1) op(READDIRP(40)) called at 2014-05-20 14:56:58.933405
[2014-05-20 15:01:51.569057] E [client3_1-fops.c:1937:client3_1_readdirp_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569484] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) 
[0x7f193f7b0729] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) 
[0x7f193f7afeee] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f193f7afe5e]))) 0-hpcscratch-client-0: forced unwinding frame 
type(GlusterFS 3.1) op(READDIRP(40)) called at 2014-05-20 14:56:58.933484
[2014-05-20 15:01:51.569487] W [rpc-clnt.c:1417:rpc_clnt_submit] 
0-hpcscratch-client-0: failed to submit rpc-request (XID: 0x289032x Program: 
GlusterFS 3.1, ProgVers: 310, Proc: 20) to rpc-transport (hpcscratch-client-0)
[2014-05-20 15:01:51.569513] E [client3_1-fops.c:1937:client3_1_readdirp_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569538] E [client3_1-fops.c:2132:client3_1_opendir_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569570] W [glusterfsd.c:727:cleanup_and_exit] 
(-->/lib64/libc.so.6(clone+0x6d) [0x30e3ce5ccd] (-->/lib64/libpthread.so.0() 
[0x30e40077f1] 
(-->/opt/glusterfs/3.2.6/sbin/glusterfs(glusterfs_sigwaiter+0x17c) 
[0x40477c]))) 0-: received signum (15), shutting down
[2014-05-20 15:01:51.569585] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) 
[0x7f193f7b0729] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) 
[0x7f193f7afeee] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f193f7afe5e]))) 0-hpcscratch-client-0: forced unwinding frame 
type(GlusterFS 3.1) op(READDIRP(40)) called at 2014-05-20 14:56:58.933615
[2014-05-20 15:01:51.569631] E [client3_1-fops.c:1937:client3_1_readdirp_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569660] W [rpc-clnt.c:1417:rpc_clnt_submit] 
0-hpcscratch-client-0: failed to submit rpc-request (XID: 0x289033x Program: 
GlusterFS 3.1, ProgVers: 310, Proc: 20) to rpc-transport (hpcscratch-client-0)
[2014-05-20 15:01:51.569677] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) 
[0x7f193f7b0729] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) 
[0x7f193f7afeee] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f193f7afe5e]))) 0-hpcscratch-client-0: forced unwinding frame 
type(GlusterFS 3.1) op(READDIRP(40)) called at 2014-05-20 14:56:58.933694
[2014-05-20 15:01:51.569707] E [client3_1-fops.c:1937:client3_1_readdirp_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569688] E [client3_1-fops.c:2132:client3_1_opendir_cbk] 
0-hpcscratch-client-0: remote operation failed: Transport endpoint is not 
connected
[2014-05-20 15:01:51.569775] E [rpc-clnt.c:341:saved_frames_unwind] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_notify+0xb9) 
[0x7f193f7b0729] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x7e) 
[0x7f193f7afeee] 
(-->/opt/glusterfs/3.2.6/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) 
[0x7f193f7afe5e]))

[Gluster-users] single problematic node (brick)

2014-05-20 Thread Doug Schouten

Hello,

	I have a rather simple Gluster configuration that consists of 85TB 
distributed across six nodes. There is one particular node that seems to 
fail on a ~ weekly basis, and I can't figure out why.


I have attached my Gluster configuration and a recent log file from the 
problematic node. For a user, when the failure occurs, the symptom is 
that any attempts to access the Gluster volume from the problematic node 
fails with "transport endpoint not connected" error.


Restarting the Gluster daemons and remounting the volume on the failed 
node always fixes the problem. But usually by that point some number of 
jobs in our batch queue have failed b/c of this issue already, and it's 
becoming a headache.


It could be a fuse issue, since I see many related error messages in the 
Gluster log, but I can't disentangle the various errors. The relevant 
line in my /etc/fstab file is


server:global /global glusterfs 
defaults,direct-io-mode=disable,log-level=WARNING,log-file=/var/log/gluster.log 
0 0


Any ideas on the source of the problem? Could it be a hardware (network) 
glitch? The fact that it only happens on one node that is identically 
configured (with same hardware) as other nodes points to something like 
that.


thanks! Doug


gluster.log.gz
Description: application/gzip


gluster.cfg.gz
Description: application/gzip
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users

[Gluster-users] Troubles with syscall 'lstat'

2014-05-20 Thread Nicolas Greneche

Hello,

I run a glusterfs architecture in 3.3.1 version :


# glusterd -V
glusterfs 3.3.1 built on Apr 29 2013 15:17:28
Repository revision: git://git.gluster.com/glusterfs.git
Copyright (c) 2006-2011 Gluster Inc. 
GlusterFS comes with ABSOLUTELY NO WARRANTY.
You may redistribute copies of GlusterFS under the terms of the GNU 
General Public License.


I have an odd problem when I run a software. When I run it from the 
local filesystem it works and when I copy it to a glusterfs share it 
produce errors.


Both instance of the program shares the same environement (the run on te 
same instance of operating system with the same user).


The only difference I noticed is with the syscall sequence. When It 
works I have this sequence :


write(1, "HRType:  esh_gain \n", 19)= 19
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0

write(1, "TILDE\n", 6)  = 6
stat64("/home/ngreneche/ubuntu1204/usr/local/ACE-ilProlog-1.2.20/linux/bin/tilde", 
{st_mode=S_IFDIR|0750, st_size=4096, ...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0

write(1, "Discretization busy...\n", 23) = 23

And when it doesn't work (running from a glusterfs share), I have this 
sequence :


write(1, "HRType:  esh_gain \n", 19)= 19
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0

write(1, "TILDE\n", 6)  = 6
stat64("/home/dist/db/ubuntu1204/usr/local/ACE-ilProlog-1.2.20/linux/bin/tilde", 
{st_mode=S_IFDIR|0750, st_size=16384, ...}) = 0
mkdir("/home/dist/db/ubuntu1204/usr/local/ACE-ilProlog-1.2.20/linux/bin/tilde", 
0755) = -1 EEXIST (File exists)
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo 
...}) = 0

write(1, "An error occurred during the exe"..., 49) = 49

The only difference is that a mkdir is performed just after the stat. 
Whereas the stat check if the directory exists to trigger or not a mkdir.


My underlying filesystem on the brick is ext4.

Do you know if there are some issues with stat in the version of glusterfs ?

Regards,

--
Nicolas Grenèche

URL : http://blog.etcshadow.fr
Tel : 01 49 40 40 35
Fax : 01 48 22 81 50
___
Gluster-users mailing list
Gluster-users@gluster.org
http://supercolony.gluster.org/mailman/listinfo/gluster-users