Re: [Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors
Ok, no problem. The issue is very rare, even with our setup - we have seen it only once on one site even though we have been in production for several months now. For now, we can live with that IMO. And, thanks again. Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors
It is possible, yes, because these are actually a kind of log files. I suppose, like other logging frameworks these files an remain open for a considerable period, and then get renamed to support log rotate semantics. That said, I might need to check with the team that actually manages the logging framework to be sure. I only take care of the file-system stuff. I can tell you for sure Monday. If it is the same race that you mention, is there a fix for it? Thanks, Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors
I see. Thanks a tonne for the thorough explanation! :) I can see that our setup would be vulnerable here because the logger on one server is not generally aware of the state of the replica on the other server. So, it is possible that the log files may have been renamed before heal had a chance to kick in. Could I also request you for the bug ID (should there be one) against which you are coding up the fix, so that we could get a notification once it is passed? Also, as an aside, is O_DIRECT supposed to prevent this from occurring if one were to make allowance for the performance hit? Thanks again, Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors
Hi, Yes, they do, and considerably. I#39;d forgotten to mention that on my last email. Their mtimes, however, as far as i could tell on separate servers, seemed to coincide. Thanks, Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS not start on localhost
Maybe share the last 15-20 lines of you /var/log/glusterfs/nfs.log for the consideration of everyone on the list? Thanks. ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] NFS not start on localhost
It happens with me sometimes. Try `tail -n 20 /var/log/glusterfs/nfs.log`. You will probably find something out that will help your cause. In general, if you just wish to start the thing up without going into the why of it, try `gluster volume set engine nfs.disable on` followed by `gluster volume set engine nfs.disable off`. It does the trick quite often for me because it is a polite way to askmgmt/glusterd to try and respawn the nfs server process if need be. But, keep in mind that this will call a (albeit small) service interruption to all clients accessing volume engine over nfs. Thanks, Anirban On Saturday, 18 October 2014 1:03 AM, Demeter Tibor tdeme...@itsmart.hu wrote: Hi, I have make a glusterfs with nfs support. I don't know why, but after a reboot the nfs does not listen on localhost, only on gs01. [root@node0 ~]# gluster volume info engine Volume Name: engine Type: Replicate Volume ID: 2ea009bf-c740-492e-956d-e1bca76a0bd3 Status: Started Number of Bricks: 1 x 2 = 2 Transport-type: tcp Bricks: Brick1: gs00.itsmart.cloud:/gluster/engine0 Brick2: gs01.itsmart.cloud:/gluster/engine1 Options Reconfigured: storage.owner-uid: 36 storage.owner-gid: 36 performance.quick-read: off performance.read-ahead: off performance.io-cache: off performance.stat-prefetch: off cluster.eager-lock: enable network.remote-dio: enable cluster.quorum-type: auto cluster.server-quorum-type: server auth.allow: * nfs.disable: off [root@node0 ~]# gluster volume status engine Status of volume: engine Gluster process Port Online Pid -- Brick gs00.itsmart.cloud:/gluster/engine0 50158 Y 3250 Brick gs01.itsmart.cloud:/gluster/engine1 50158 Y 5518 NFS Server on localhost N/A N N/A Self-heal Daemon on localhost N/A Y 3261 NFS Server on gs01.itsmart.cloud 2049 Y 5216 Self-heal Daemon on gs01.itsmart.cloud N/A Y 5223 Does anybody help me? Thanks in advance. Tibor ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Split-brain seen with [0 0] pending matrix and io-cache page errors
Hi everyone, I have this really confusing split-brain here that's bothering me. I am running glusterfs 3.4.2 over linux 2.6.34. I have a replica 2 volume 'testvol' that is It seems I cannot read/stat/edit the file in question, and `gluster volume heal testvol info split-brain` shows nothing. Here are the logs from the fuse-mount for the volume: [2014-09-29 07:53:02.867111] W [fuse-bridge.c:1172:fuse_err_cbk] 0-glusterfs-fuse: 4560969: FLUSH() ERR = -1 (Input/output error) [2014-09-29 07:54:16.007799] W [page.c:991:__ioc_page_error] 0-testvol-io-cache: page error for page = 0x7fd5c8529d20 waitq = 0x7fd5c8067d40 [2014-09-29 07:54:16.007854] W [fuse-bridge.c:2089:fuse_readv_cbk] 0-glusterfs-fuse: 4561103: READ = -1 (Input/output error) [2014-09-29 07:54:16.008018] W [page.c:991:__ioc_page_error] 0-testvol-io-cache: page error for page = 0x7fd5c8607ee0 waitq = 0x7fd5c8067d40 [2014-09-29 07:54:16.008056] W [fuse-bridge.c:2089:fuse_readv_cbk] 0-glusterfs-fuse: 4561104: READ = -1 (Input/output error) [2014-09-29 07:54:16.008233] W [page.c:991:__ioc_page_error] 0-testvol-io-cache: page error for page = 0x7fd5c8066f30 waitq = 0x7fd5c8067d40 [2014-09-29 07:54:16.008269] W [fuse-bridge.c:2089:fuse_readv_cbk] 0-glusterfs-fuse: 4561105: READ = -1 (Input/output error) [2014-09-29 07:54:16.008800] W [page.c:991:__ioc_page_error] 0-testvol-io-cache: page error for page = 0x7fd5c860bcf0 waitq = 0x7fd5c863b1f0 [2014-09-29 07:54:16.008839] W [fuse-bridge.c:2089:fuse_readv_cbk] 0-glusterfs-fuse: 4561107: READ = -1 (Input/output error) [2014-09-29 07:54:16.009365] W [page.c:991:__ioc_page_error] 0-testvol-io-cache: page error for page = 0x7fd5c85fd120 waitq = 0x7fd5c8067d40 [2014-09-29 07:54:16.009413] W [fuse-bridge.c:2089:fuse_readv_cbk] 0-glusterfs-fuse: 4561109: READ = -1 (Input/output error) [2014-09-29 07:54:16.040549] W [afr-open.c:213:afr_open] 0-testvol-replicate-0: failed to open as split brain seen, returning EIO [2014-09-29 07:54:16.040594] W [fuse-bridge.c:915:fuse_fd_cbk] 0-glusterfs-fuse: 4561142: OPEN() /SECLOG/20140908.d/SECLOG_00427425_.log = -1 (Input/output error) Could somebody please give me some clue on where to begin? I checked the xattrs on /SECLOG/20140908.d/SECLOG_00427425_.log and it seems the changelogs are [0, 0] on both replicas, and the gfid's match. Thank you very much for any help on this. Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Need help: Mount -t glusterfs hangs
Hello, I am facing an intermittent issue with the mount.glusterfs program freezing up. Investigation reveals that it is not exactly the mount system call but the 'stat' call from within mount.glusterfs that is actually hung. Any other command such as df or ls also hang. I opened a redhat bugzilla bug against this. It has more details on the precise observation and steps followed. What intrigues me is that as soon as I set a volume option such as one for diagnostics or maybe the statedump path, the problem goes away. Any advice would be greatly appreciated as this is causing problems with some server replacement procedures I am trying out. Thanks a lot, Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Need help: Mount -t glusterfs hangs
Yes, very sorry. Actually, I kind of missed adding the bug id which made it all look so incomplete. Fact is, I am totally caught up in work today so am more absent minded than usual.. Thanks for pointing that out! As to the other details: Bug #1141940 Glusterfs 3.4.2 Linux 2.6.34 Also, I was kind of hoping that the fact that setting a volume option sort of might ring a bell with someone.. Someone who knows precisely what they do.. Thanks again, Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Enforce direct-io for volume rather than mount
Suppose I have a replica 2 setup using XFS for my volume, and that I also export my files over nfs, and write access is expected. Now if I wish to avoid split brains at all cost, even during server crashes and such, I am given to understand that direct-io mode would help. I know that there is a direct-io option with my mount.glusterfs program, but I wish for a direct-io enforcement for my entire volume so as to enforce uncached writes to the underlying XFS. Any ideas on how I could achieve that? Thanks for your replies. Anirban___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] : On breaking the connection between replicated volumes certain files return -ENOTCONN
We migrated to stable version 3.4.2 and confirmed that the error occurs with that as well. I reported this over bug 1062287. Thanks again, Anirban -- On Tue 4 Feb, 2014 2:27 PM MST Anirban Ghoshal wrote: Hi everyone, Here's a strange issue. I am using glusterfs 3.4.0 alpha. We need to move to a stable version ASAP, but I am telling you this just off chance that it might be interesting for somebody from the glusterfs development team. Please excuse the sheer length of this mail, but I am new to browsing such massive code, and not good at presenting my ideas very clearly. Here's a set of observations: 1. You have a replica 2 volume (testvol) on server1 and server2. You assume that on either server, it is also locally mounted via mount.glusterfs at /testvol. 2. You have a large number of soft-linked files within the volume. 3. You check heal info (all its facets) to ensure not a single file is out of sync (also, verify md5sum or such, if possible). 4. You abrupty take down the ethernet device over which the servers are conencted (ip link set eth-dev down). 5. On one of the servers (say, server1 for definiteness), if you do an 'ls -l' readlink returns 'Transport endpoint is not connected'. 6. The error resolves all by itself if you get the eth-link up. Here's some additional detail: 7. The error is intermittent, and not all soft-linked files have the issue. 8. If you take a directory containing soft-linked files, and if you do a ls -l _on_the_directory, like so, server1$ ls -l /testvol/somedir/bin/ ls: cannot read symbolic link /testvol/somedir/bin/reset: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/bzless: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/i386: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/kill: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/linux32: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/linux64: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/logger: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/x86_64: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/python2: Transport endpoint is not connected connected 9. If, however, you take a faulty soft-link and do an ls -l on it directly, then it rights itself immediately. server1$ ls -l /testvol/somedir/bin/x86_64 lrwxrwxrwx 1 root root 7 May 7 23:11 /testvol/somedir/bin/x86_64 - setarch I tried raising the client log level to 'trace'. Here's what I saw: Upon READLINK failures, (ls -l /testvol/somedir/bin/): [2010-05-09 01:13:28.140265] T [fuse-bridge.c:2453:fuse_readdir_cbk] 0-glusterfs-fuse: 2783484: READDIR = 23/4096,1380 [2010-05-09 01:13:28.140444] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 45 [2010-05-09 01:13:28.140477] T [fuse-bridge.c:708:fuse_getattr_resume] 0-glusterfs-fuse: 2783485: GETATTR 140299577689176 (/testvol/somedir/bin) [2010-05-09 01:13:28.140618] T [fuse-bridge.c:641:fuse_attr_cbk] 0-glusterfs-fuse: 2783485: STAT() /testvol/somedir/bin = -5626802993936595428 [2010-05-09 01:13:28.140722] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 52 [2010-05-09 01:13:28.140737] T [fuse-bridge.c:506:fuse_lookup_resume] 0-glusterfs-fuse: 2783486: LOOKUP /testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03) [2010-05-09 01:13:28.140851] T [fuse-bridge.c:376:fuse_entry_cbk] 0-glusterfs-fuse: 2783486: LOOKUP() /testvol/somedir/bin/x86_64 = -4857810743645185021 [2010-05-09 01:13:28.140954] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 52 [2010-05-09 01:13:28.140975] T [fuse-bridge.c:1296:fuse_readlink_resume] 0-glusterfs-fuse: 2783487 READLINK /testvol/somedir/bin/x86_64/025d1c57-865f-4f1f-bc95-96ddcef3dc03 [2010-05-09 01:13:28.141090] D [afr-common.c:760:afr_get_call_child] 0-_testvol-replicate-0: Returning -107, call_child: -1, last_index: -1 [2010-05-09 01:13:28.141120] W [fuse-bridge.c:1271:fuse_readlink_cbk] 0-glusterfs-fuse: 2783487: /testvol/somedir/bin/x86_64 = -1 (Transport endpoint is not connected) Upon successful readlink (ls -l /testvol/somedir/bin/x86_64): [2010-05-09 01:13:37.717904] T [fuse-bridge.c:376:fuse_entry_cbk] 0-glusterfs-fuse: 2790073: LOOKUP() /testvol/somedir/bin = -5626802993936595428 [2010-05-09 01:13:37.718070] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 52 [2010-05-09 01:13:37.718127] T [fuse-bridge.c:506:fuse_lookup_resume] 0-glusterfs-fuse: 2790074: LOOKUP /testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03) [2010-05-09 01:13:37.718306] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-_testvol-replicate-0: /testvol/somedir/bin/x86_64
[Gluster-users] On breaking the connection between replicated volumes certain files return -ENOTCONN
Hi everyone, Here's a strange issue. I am using glusterfs 3.4.0 alpha. We need to move to a stable version ASAP, but I am telling you this just off chance that it might be interesting for somebody from the glusterfs development team. Please excuse the sheer length of this mail, but I am new to browsing such massive code, and not good at presenting my ideas very clearly. Here's a set of observations: 1. You have a replica 2 volume (testvol) on server1 and server2. You assume that on either server, it is also locally mounted via mount.glusterfs at /testvol. 2. You have a large number of soft-linked files within the volume. 3. You check heal info (all its facets) to ensure not a single file is out of sync (also, verify md5sum or such, if possible). 4. You abrupty take down the ethernet device over which the servers are conencted (ip link set eth-dev down). 5. On one of the servers (say, server1 for definiteness), if you do an 'ls -l' readlink returns 'Transport endpoint is not connected'. 6. The error resolves all by itself if you get the eth-link up. Here's some additional detail: 7. The error is intermittent, and not all soft-linked files have the issue. 8. If you take a directory containing soft-linked files, and if you do a ls -l _on_the_directory, like so, server1$ ls -l /testvol/somedir/bin/ ls: cannot read symbolic link /testvol/somedir/bin/reset: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/bzless: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/i386: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/kill: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/linux32: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/linux64: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/logger: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/x86_64: Transport endpoint is not connected ls: cannot read symbolic link /testvol/somedir/bin/python2: Transport endpoint is not connected connected 9. If, however, you take a faulty soft-link and do an ls -l on it directly, then it rights itself immediately. server1$ ls -l /testvol/somedir/bin/x86_64 lrwxrwxrwx 1 root root 7 May 7 23:11 /testvol/somedir/bin/x86_64 - setarch I tried raising the client log level to 'trace'. Here's what I saw: Upon READLINK failures, (ls -l /testvol/somedir/bin/): [2010-05-09 01:13:28.140265] T [fuse-bridge.c:2453:fuse_readdir_cbk] 0-glusterfs-fuse: 2783484: READDIR = 23/4096,1380 [2010-05-09 01:13:28.140444] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 45 [2010-05-09 01:13:28.140477] T [fuse-bridge.c:708:fuse_getattr_resume] 0-glusterfs-fuse: 2783485: GETATTR 140299577689176 (/testvol/somedir/bin) [2010-05-09 01:13:28.140618] T [fuse-bridge.c:641:fuse_attr_cbk] 0-glusterfs-fuse: 2783485: STAT() /testvol/somedir/bin = -5626802993936595428 [2010-05-09 01:13:28.140722] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 52 [2010-05-09 01:13:28.140737] T [fuse-bridge.c:506:fuse_lookup_resume] 0-glusterfs-fuse: 2783486: LOOKUP /testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03) [2010-05-09 01:13:28.140851] T [fuse-bridge.c:376:fuse_entry_cbk] 0-glusterfs-fuse: 2783486: LOOKUP() /testvol/somedir/bin/x86_64 = -4857810743645185021 [2010-05-09 01:13:28.140954] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 52 [2010-05-09 01:13:28.140975] T [fuse-bridge.c:1296:fuse_readlink_resume] 0-glusterfs-fuse: 2783487 READLINK /testvol/somedir/bin/x86_64/025d1c57-865f-4f1f-bc95-96ddcef3dc03 [2010-05-09 01:13:28.141090] D [afr-common.c:760:afr_get_call_child] 0-_testvol-replicate-0: Returning -107, call_child: -1, last_index: -1 [2010-05-09 01:13:28.141120] W [fuse-bridge.c:1271:fuse_readlink_cbk] 0-glusterfs-fuse: 2783487: /testvol/somedir/bin/x86_64 = -1 (Transport endpoint is not connected) Upon successful readlink (ls -l /testvol/somedir/bin/x86_64): [2010-05-09 01:13:37.717904] T [fuse-bridge.c:376:fuse_entry_cbk] 0-glusterfs-fuse: 2790073: LOOKUP() /testvol/somedir/bin = -5626802993936595428 [2010-05-09 01:13:37.718070] T [fuse-resolve.c:51:fuse_resolve_loc_touchup] 0-fuse: return value inode_path 52 [2010-05-09 01:13:37.718127] T [fuse-bridge.c:506:fuse_lookup_resume] 0-glusterfs-fuse: 2790074: LOOKUP /testvol/somedir/bin/x86_64(025d1c57-865f-4f1f-bc95-96ddcef3dc03) [2010-05-09 01:13:37.718306] D [afr-common.c:131:afr_lookup_xattr_req_prepare] 0-_testvol-replicate-0: /testvol/somedir/bin/x86_64: failed to get the gfid from dict [2010-05-09 01:13:37.718355] T [rpc-clnt.c:1301:rpc_clnt_record] 0-_testvol-client-1: Auth Info: pid: 3343, uid: 0, gid: 0, owner: [2010-05-09 01:13:37.718383] T
Re: [Gluster-users] File (setuid) permission changes during volume heal - possible bug?
Hi Ravi, Many thanks for the super-quick turnaround on this! Didn't know about this one quirk os chown, so thanks for that as well. Anirban On Thursday, 30 January 2014 9:22 AM, Ravishankar N ravishan...@redhat.com wrote: Hi Anirban, Thanks for taking the time off to file the bugzilla bug report. The fix has been sent for review upstream (http://review.gluster.org/#/c/6862/). Once it is merged, I will backport it to 3.4 as well. Regards, Ravi On 01/28/2014 02:07 AM, Chalcogen wrote: Hi, I am working on a twin-replicated setup (server1 and server2) with glusterfs 3.4.0. I perform the following steps: 1. Create a distributed volume 'testvol' with the XFS brick server1:/brick/testvol on server1, and mount it using the glusterfs native client at /testvol. 2. I copy the following file to /testvol: server1:~$ ls -l /bin/su -rwsr-xr-x 1 root root 84742 Jan 17 2014 /bin/su server1:~$ cp -a /bin/su /testvol 3. Within /testvol if I list out the file I just copied, I find its attributes intact. 4. Now, I add the XFS brick server2:/brick/testvol. server2:~$ gluster volume add-brick testvol replica 2 server2:/brick/testvol At this point, heal kicks in and the file is replicated on server 2. 5. If I list out su in testvol on either server now, now, this is what I see. server1:~$ ls -l /testvol/su -rwsr-xr-x 1 root root 84742 Jan 17 2014 /bin/su server2:~$ ls -l /testvol/su -rwxr-xr-x 1 root root 84742 Jan 17 2014 /bin/suThat is, the 's' file mode gets changed to plain 'x' - meaning, all the attributes are not preserved upon heal completion. Would you consider this a bug? Is the behavior different on a higher release? Thanks a lot. Anirban ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
[Gluster-users] Mounting soft-linked paths over nfs
Dear gluster users, I am facing a tiny bit of issue with the glusterfs nfs server. If I export a volume testvol, and within testvol, I have a path, say, dir1/dir2. If dir1 and dir2 are actual directories, then one can simply mount testvol/dir1/dir2 over nfs. However, if either of dir1 or dir2 is a soft-link, then mount.nfs returns -EINVAL. Would you say that this is normal behavior with this nfs server? Also, I am using the 3.4.0 release. Would it help if I upgrade? Thanks a lot, Anirban ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Passing noforget option to glusterfs native client mounts
Hi, and Thanks a lot, Anand! I was initially searching for a good answer to why the glusterfs site lists knfsd as NOT compatible with the glusterfs. So, now I know. :) Funnily enough, we didn't have a problem with the failover during our testing. We passed constant fsid's (fsid=xxx) while exporting our mounts and NFS mounts on client applications haven't called any of the file handles out stale while migrating the NFS service from one server to the other. Not sure why this happpens. Do nodeid's and generation numbers remain invariant across storage servers in glusterfs-3.4.0? We, for our part, have a pretty small amount of data in our filesystem (that is, compared with the petabyte sized volumes glusterfs commonly manages). Our total volume size would be somewhere around 4 GB, and some 50, 000 files is all they contain. Each server has around 16 GB of RAM, so space is not at a premium for this project... However, saying that, if glusterfs NFS server does maintain identical file handles across all its servers and does not alter file-handles upon failover, then in the long run it might be prudent to switch to glusterFS NFS as the cleaner solution... Thanks again! Anirban On Tuesday, 24 December 2013 1:58 PM, Anand Avati av...@gluster.org wrote: Hi, Allowing noforget option to FUSE will not help for your cause. Gluster persents the address of the inode_t as the nodeid to FUSE. In turn FUSE creates a filehandle using this nodeid for knfsd to export to nfs client. When knfsd fails over to another server, FUSE will decode the handle encoded by the other NFS server and try to use the nodeid of the other server - which will obviously not work as the virtual address of glusterfs process on the other server is not valid here. Short version: the file-handle generated through FUSE is not durable. The noforget option in FUSE is a hack to avoid ESTALE messages because of dcache pruning. If you have enough inode in your volume, your system will go OOM at some point. The noforget is NOT a solution for providing NFS failover to a different server. For reasons such as these, we ended up implementing our own NFS server where we encode a filehandle using the GFID (which is durable across reboots and server failovers). I would strongly recommend NOT using knfsd with any FUSE based filesystems (not just glusterfs) for a serious production use, and it will just not work if you are designing for NFS high availability/fail-over. Thanks, Avati On Sat, Dec 21, 2013 at 8:52 PM, Anirban Ghoshal chalcogen_eg_oxy...@yahoo.com wrote: If somebody has an idea on how this could be done, could you please help out? I am still stuck on this, apparently... Thanks, Anirban On Thursday, 19 December 2013 1:40 AM, Chalcogen chalcogen_eg_oxy...@yahoo.com wrote: P.s. I think I need to clarify this: I am only reading from the mounts, and not modifying anything on the server. and so the commonest causes on stale file handles do not appy. Anirban On Thursday 19 December 2013 01:16 AM, Chalcogen wrote: Hi everybody, A few months back I joined a project where people want to replace their legacy fuse-based (twin-server) replicated file-system with GlusterFS. They also have a high-availability NFS server code tagged with the kernel NFSD that they would wish to retain (the nfs-kernel-server, I mean). The reason they wish to retain the kernel NFS and not use the NFS server that comes with GlusterFS is mainly because there's this bit of code that allows NFS IP's to be migrated from one host server to the other in the case that one happens to go down, and tweaks on the export server configuration allow the file-handles to remain identical on the new host server. The solution was to mount gluster volumes using the mount.glusterfs native client program and then export the directories over the kernel NFS server. This seems to work most of the time, but on rare occasions, 'stale file handle' is reported off certain clients, which really puts a damper over the 'high-availability' thing. After suitably instrumenting the nfsd/fuse code in the kernel, it seems that decoding of the file-handle fails on the server because the inode record corresponding to the nodeid in the handle cannot be looked up. Combining this with the fact that a second attempt by the client to execute lookup on the same file passes, one might suspect that the problem is identical to what many people attempting to export fuse mounts over the kernel's NFS server are facing; viz, fuse 'forgets' the inode records thereby causing ilookup5() to fail. Miklos and other fuse developers/hackers would point towards '-o noforget' while mounting their fuse file-systems. I tried passing '-o noforget' to mount.glusterfs, but it does
Re: [Gluster-users] Passing noforget option to glusterfs native client mounts
Thanks, Harshavardhana, Anand for the tips! I checked out parts of the linux-2.6.34 (the one we are using down here) knfsd/fuse code. I understood (hopefully, rightly) that when we export a fuse directory over NFS and specify an fsid, the handle is constructed somewhat like this: fh_size (4 bytes) - fh_version and stuff (4 bytes) - fsid, from export parms (4 bytes) - nodeid (8 bytes) - generation number (4 bytes) - parent nodeid (8 bytes) - parent generation (4 bytes). So, since Anand mentions that nodeid's for glusterfs are just the inode_t addresses on servers, I can now relate to the fact that the file handles might not even survive failovers in any and every case, even with the fsid constant. That's why I was so confused.. I never faced an issue with stale file handles during failover yet! Maybe something to do with the order in which files were created on the replica server following heal commencement (our data is quite static btw) - like, if you malloc identical things on two identical platforms by running the same executable on each, you get allocations at the exact same virtual addresses... However, now that I understand at least in part how this works, Glusterfs NFS does seem a lot cleaner... Will also try out the Ganesha... Thanks! On Tuesday, 24 December 2013 11:04 PM, Harshavardhana har...@harshavardhana.net wrote: On Tue, Dec 24, 2013 at 8:21 AM, Anirban Ghoshal chalcogen_eg_oxy...@yahoo.com wrote: Hi, and Thanks a lot, Anand! I was initially searching for a good answer to why the glusterfs site lists knfsd as NOT compatible with the glusterfs. So, now I know. :) Funnily enough, we didn't have a problem with the failover during our testing. We passed constant fsid's (fsid=xxx) while exporting our mounts and NFS mounts on client applications haven't called any of the file handles out stale while migrating the NFS service from one server to the other. Not sure why this happpens. Using fsid is just a workaround always used to solve ESTALE on file handles. The device major/minor numbers are embedded in the NFS file handle, a problem when an NFS export is failed over or moved to another node during failover is that these numbers change when the resource is exported on the new node resulting in client to see a Stale NFS file handle error. We need to make sure the embedded number stays the same that is where the fsid export option - allowing us to specify a coherent number across various clients. GlusterNFS server is way cleaner solution for such consistency. Another thing would be to take the next step, give a go for 'NFS-Ganesha' and 'GlusterFS' integration? https://forge.gluster.org/nfs-ganesha-and-glusterfs-integration http://www.gluster.org/2013/09/gluster-ganesha-nfsv4-initial-impressions/ Cheers -- Religious confuse piety with mere ritual, the virtuous confuse regulation with outcomes___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users
Re: [Gluster-users] Passing noforget option to glusterfs native client mounts
If somebody has an idea on how this could be done, could you please help out? I am still stuck on this, apparently... Thanks, Anirban On Thursday, 19 December 2013 1:40 AM, Chalcogen chalcogen_eg_oxy...@yahoo.com wrote: P.s. I think I need to clarify this: I am only reading from the mounts, and not modifying anything on the server. and so the commonest causes on stale file handles do not appy. Anirban On Thursday 19 December 2013 01:16 AM, Chalcogen wrote: Hi everybody, A few months back I joined a project where people want to replace their legacy fuse-based (twin-server) replicated file-system with GlusterFS. They also have a high-availability NFS server code tagged with the kernel NFSD that they would wish to retain (the nfs-kernel-server, I mean). The reason they wish to retain the kernel NFS and not use the NFS server that comes with GlusterFS is mainly because there's this bit of code that allows NFS IP's to be migrated from one host server to the other in the case that one happens to go down, and tweaks on the export server configuration allow the file-handles to remain identical on the new host server. The solution was to mount gluster volumes using the mount.glusterfs native client program and then export the directories over the kernel NFS server. This seems to work most of the time, but on rare occasions, 'stale file handle' is reported off certain clients, which really puts a damper over the 'high-availability' thing. After suitably instrumenting the nfsd/fuse code in the kernel, it seems that decoding of the file-handle fails on the server because the inode record corresponding to the nodeid in the handle cannot be looked up. Combining this with the fact that a second attempt by the client to execute lookup on the same file passes, one might suspect that the problem is identical to what many people attempting to export fuse mounts over the kernel's NFS server are facing; viz, fuse 'forgets' the inode records thereby causing ilookup5() to fail. Miklos and other fuse developers/hackers would point towards '-o noforget' while mounting their fuse file-systems. I tried passing '-o noforget' to mount.glusterfs, but it does not seem to recognize it. Could somebody help me out with the correct syntax to pass noforget to gluster volumes? Or, something we could pass to glusterfs that would instruct fuse to allocate a bigger cache for our inodes? Additionally, should you think that something else might be behind our problems, please do let me know. Here's my configuration: Linux kernel version: 2.6.34.12 GlusterFS versionn: 3.4.0 nfs.disable option for volumes: OFF on all volumes Thanks a lot for your time! Anirban P.s. I found quite a few pages on the web that admonish users that GlusterFS is not compatible with the kernel NFS server, but do not really give much detail. Is this one of the reasons for saying so? ___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users___ Gluster-users mailing list Gluster-users@gluster.org http://supercolony.gluster.org/mailman/listinfo/gluster-users