Re: [Gluster-users] : On breaking the connection between replicated volumes certain files return -ENOTCONN

2014-02-06 Thread Anirban Ghoshal

We migrated to stable version 3.4.2 and confirmed that the error occurs with 
that as well. I reported this over bug 1062287.

Thanks again,
Anirban



--
On Tue 4 Feb, 2014 2:27 PM MST Anirban Ghoshal wrote:

>Hi everyone,
>
>Here's a strange issue. I am using glusterfs 3.4.0 alpha. We need to move to a 
>stable version ASAP, but I am telling you this just off chance that it might 
>be interesting for somebody from the glusterfs development team. Please excuse 
>the sheer length of this mail, but I am new to browsing such massive code, and 
>not good at presenting my ideas very clearly.
>
>
>Here's a set of observations:
>
>1. You have a replica 2 volume (testvol) on server1 and server2. You assume 
>that on either server, it is also locally mounted via mount.glusterfs at 
>/testvol.
>2. You have a large number of soft-linked files within the volume.
>3. You check heal info (all its facets) to ensure not a single file is out of 
>sync (also, verify md5sum or such, if possible).
>4. You abrupty take down the ethernet device over which the servers are 
>conencted (ip link set  down).
>5. On one of the servers (say, server1 for definiteness), if you do an 'ls -l' 
>readlink returns 'Transport endpoint is not connected'.
>6. The error resolves all by itself if you get the eth-link up.
>
>Here's some additional detail:
>7. The error is intermittent, and not all soft-linked files have the issue.
>8. If you take a directory containing soft-linked files, and if you do a ls -l 
>_on_the_directory, like so,
>
>server1$ ls -l /testvol/somedir/bin/
>
>ls: cannot read symbolic link /testvol/somedir/bin/reset: Transport endpoint 
>is not connected
>ls: cannot read symbolic link /testvol/somedir/bin/bzless: Transport endpoint 
>is not connected
>ls: cannot read symbolic link /testvol/somedir/bin/i386: Transport endpoint is 
>not connected
>ls: cannot read symbolic link /testvol/somedir/bin/kill: Transport endpoint is 
>not connected
>ls: cannot read symbolic link /testvol/somedir/bin/linux32: Transport endpoint 
>is not connected
>ls: cannot read symbolic link /testvol/somedir/bin/linux64: Transport endpoint 
>is not connected
>ls: cannot read symbolic link /testvol/somedir/bin/logger: Transport endpoint 
>is not connected
>ls: cannot read symbolic link /testvol/somedir/bin/x86_64: Transport endpoint 
>is not connected
>ls: cannot read symbolic link /testvol/somedir/bin/python2: Transport endpoint 
>is not connected
>connected
>
>
>9. If, however, you take a faulty soft-link and do an ls -l on it directly, 
>then it rights itself immediately.
>
>server1$ ls -l /testvol/somedir/bin/x86_64
>lrwxrwxrwx 1 root root 7 May  7 23:11 /testvol/somedir/bin/x86_64 -> setarch
>
>
>I tried raising the client log level to 'trace'. Here's what I saw:
>
>Upon READLINK failures, (ls -l /testvol/somedir/bin/):
>
>[2010-05-09 01:13:28.140265] T [fuse-bridge.c:2453:fuse_readdir_cbk] 
>0-glusterfs-fuse: 2783484: READDIR => 23/4096,1380
>[2010-05-09 01:13:28.140444] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
>0-fuse: return value inode_path 45
>[2010-05-09 01:13:28.140477] T [fuse-bridge.c:708:fuse_getattr_resume] 
>0-glusterfs-fuse: 2783485: GETATTR 140299577689176 (/testvol/somedir/bin)
>[2010-05-09 01:13:28.140618] T [fuse-bridge.c:641:fuse_attr_cbk] 
>0-glusterfs-fuse: 2783485: STAT() /testvol/somedir/bin => -5626802993936595428
>[2010-05-09 01:13:28.140722] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
>0-fuse: return value inode_path 52
>[2010-05-09 01:13:28.140737] T [fuse-bridge.c:506:fuse_lookup_resume] 
>0-glusterfs-fuse: 2783486: LOOKUP 
>/testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03)
>[2010-05-09 01:13:28.140851] T [fuse-bridge.c:376:fuse_entry_cbk] 
>0-glusterfs-fuse: 2783486: LOOKUP() /testvol/somedir/bin/x86_64 => 
>-4857810743645185021
>[2010-05-09 01:13:28.140954] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
>0-fuse: return value inode_path 52
>[2010-05-09 01:13:28.140975] T [fuse-bridge.c:1296:fuse_readlink_resume] 
>0-glusterfs-fuse: 2783487 READLINK 
>/testvol/somedir/bin/x86_64/025d1c57-865f-4f1f-bc95-96ddcef3dc03 
>[2010-05-09 01:13:28.141090] D [afr-common.c:760:afr_get_call_child] 
>0-_testvol-replicate-0: Returning -107, call_child: -1, last_index: -1
>[2010-05-09 01:13:28.141120] W [fuse-bridge.c:1271:fuse_readlink_cbk] 
>0-glusterfs-fuse: 2783487: /testvol/somedir/bin/x86_64 => -1 (Transport 
>endpoint is not connected)
>
>Upon successful readlink (ls -l /testvol/somedir/bin/x86_64):
>
>[2010-05-09 01:13:37.717904] T [fuse-bridge.c:376:fuse_entry_cbk] 
>0-glusterfs-fuse: 2790073: LOOKUP() /testvol/somedir/bin => 
>-5626802993936595428
>[2010-05-09 01:13:37.718070] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
>0-fuse: return value inode_path 52
>[2010-05-09 01:13:37.718127] T [fuse-bridge.c:506:fuse_lookup_resume] 
>0-glusterfs-fuse: 2790074: LOOKUP 
>/testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03)
>[2010-05-09 01:13:37.718306] D [afr-com

[Gluster-users] On breaking the connection between replicated volumes certain files return -ENOTCONN

2014-02-04 Thread Anirban Ghoshal
Hi everyone,

Here's a strange issue. I am using glusterfs 3.4.0 alpha. We need to move to a 
stable version ASAP, but I am telling you this just off chance that it might be 
interesting for somebody from the glusterfs development team. Please excuse the 
sheer length of this mail, but I am new to browsing such massive code, and not 
good at presenting my ideas very clearly.


Here's a set of observations:

1. You have a replica 2 volume (testvol) on server1 and server2. You assume 
that on either server, it is also locally mounted via mount.glusterfs at 
/testvol.
2. You have a large number of soft-linked files within the volume.
3. You check heal info (all its facets) to ensure not a single file is out of 
sync (also, verify md5sum or such, if possible).
4. You abrupty take down the ethernet device over which the servers are 
conencted (ip link set  down).
5. On one of the servers (say, server1 for definiteness), if you do an 'ls -l' 
readlink returns 'Transport endpoint is not connected'.
6. The error resolves all by itself if you get the eth-link up.

Here's some additional detail:
7. The error is intermittent, and not all soft-linked files have the issue.
8. If you take a directory containing soft-linked files, and if you do a ls -l 
_on_the_directory, like so,

server1$ ls -l /testvol/somedir/bin/

ls: cannot read symbolic link /testvol/somedir/bin/reset: Transport endpoint is 
not connected
ls: cannot read symbolic link /testvol/somedir/bin/bzless: Transport endpoint 
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/i386: Transport endpoint is 
not connected
ls: cannot read symbolic link /testvol/somedir/bin/kill: Transport endpoint is 
not connected
ls: cannot read symbolic link /testvol/somedir/bin/linux32: Transport endpoint 
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/linux64: Transport endpoint 
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/logger: Transport endpoint 
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/x86_64: Transport endpoint 
is not connected
ls: cannot read symbolic link /testvol/somedir/bin/python2: Transport endpoint 
is not connected
connected


9. If, however, you take a faulty soft-link and do an ls -l on it directly, 
then it rights itself immediately.

server1$ ls -l /testvol/somedir/bin/x86_64
lrwxrwxrwx 1 root root 7 May  7 23:11 /testvol/somedir/bin/x86_64 -> setarch


I tried raising the client log level to 'trace'. Here's what I saw:

Upon READLINK failures, (ls -l /testvol/somedir/bin/):

[2010-05-09 01:13:28.140265] T [fuse-bridge.c:2453:fuse_readdir_cbk] 
0-glusterfs-fuse: 2783484: READDIR => 23/4096,1380
[2010-05-09 01:13:28.140444] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
0-fuse: return value inode_path 45
[2010-05-09 01:13:28.140477] T [fuse-bridge.c:708:fuse_getattr_resume] 
0-glusterfs-fuse: 2783485: GETATTR 140299577689176 (/testvol/somedir/bin)
[2010-05-09 01:13:28.140618] T [fuse-bridge.c:641:fuse_attr_cbk] 
0-glusterfs-fuse: 2783485: STAT() /testvol/somedir/bin => -5626802993936595428
[2010-05-09 01:13:28.140722] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
0-fuse: return value inode_path 52
[2010-05-09 01:13:28.140737] T [fuse-bridge.c:506:fuse_lookup_resume] 
0-glusterfs-fuse: 2783486: LOOKUP 
/testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03)
[2010-05-09 01:13:28.140851] T [fuse-bridge.c:376:fuse_entry_cbk] 
0-glusterfs-fuse: 2783486: LOOKUP() /testvol/somedir/bin/x86_64 => 
-4857810743645185021
[2010-05-09 01:13:28.140954] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
0-fuse: return value inode_path 52
[2010-05-09 01:13:28.140975] T [fuse-bridge.c:1296:fuse_readlink_resume] 
0-glusterfs-fuse: 2783487 READLINK 
/testvol/somedir/bin/x86_64/025d1c57-865f-4f1f-bc95-96ddcef3dc03 
[2010-05-09 01:13:28.141090] D [afr-common.c:760:afr_get_call_child] 
0-_testvol-replicate-0: Returning -107, call_child: -1, last_index: -1
[2010-05-09 01:13:28.141120] W [fuse-bridge.c:1271:fuse_readlink_cbk] 
0-glusterfs-fuse: 2783487: /testvol/somedir/bin/x86_64 => -1 (Transport 
endpoint is not connected)

Upon successful readlink (ls -l /testvol/somedir/bin/x86_64):

[2010-05-09 01:13:37.717904] T [fuse-bridge.c:376:fuse_entry_cbk] 
0-glusterfs-fuse: 2790073: LOOKUP() /testvol/somedir/bin => -5626802993936595428
[2010-05-09 01:13:37.718070] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 
0-fuse: return value inode_path 52
[2010-05-09 01:13:37.718127] T [fuse-bridge.c:506:fuse_lookup_resume] 
0-glusterfs-fuse: 2790074: LOOKUP 
/testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03)
[2010-05-09 01:13:37.718306] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 
0-_testvol-replicate-0: /testvol/somedir/bin/x86_64: failed to get the gfid 
from dict
[2010-05-09 01:13:37.718355] T [rpc-clnt.c:1301:rpc_clnt_record] 
0-_testvol-client-1: Auth Info: pid: 3343, uid: 0, gid: 0, owner: 

[2010-05-09 01:13:37.718383] T [rpc-clnt.c:1181:rpc_clnt_recor