Public bug reported:

Consider that we have a fstab entry that uses both volfile-server and
backupvolfile-server in a Ubuntu 14.04.2 LTS Server:

    mygluster:/mydir /var/mydir glusterfs
defaults,nobootwait,nofail,_netdev,backupvolfile-server=mygluster-bak 0
0

If the host mygluster is accessible at boot time, the mount works with
success. However, if mygluster is offline (because a DNS error, for
example) and mygluster-bak is online, the mount fails at boot time.

The bug only occurs at boot time. After the boot, if we run 'mount
/var/mydir', the mount will work using the mygluster-bak server as
expected.

## How to reproduce

Put the following entry in your fstab:

    non-existent:/mydir /var/mydir glusterfs
defaults,nobootwait,nofail,_netdev,backupvolfile-server=mygluster-bak 0
0

Mount the system and check that the mount worked with success:

    $ mount /var/mydir; mount | grep mydir; umount
    non-existent:/mydir on /var/mydir type fuse.glusterfs 
(rw,default_permissions,allow_other,max_read=131072)

Now, reboot your system some times and check that sometimes the mount
has failed. At this moment, run 'mount /var/mydir' and successfully
mount the filesystem.

    $ mount | grep mydir
    $ mount /var/mydir; mount | grep mydir
    non-existent:/mydir on /var/mydir type fuse.glusterfs 
(rw,default_permissions,allow_other,max_read=131072)

### Logs

The boot.log, dmesg, mountall and the gluster logfile (var-lib-glance-
images.log, at my specific case) will be attached  (each one in a
separate comment because of
https://bugs.launchpad.net/launchpad/+bug/82652).

However, the only log that really helps is the gluster logfile, with
entries like:

    [glusterfsd.c:1910:main] 0-/usr/sbin/glusterfs: Started running 
/usr/sbin/glusterfs version 3.4.2 (/usr/sbin/glusterfs --volfile-id=/mydir 
--volfile-server=non-existent /var/mydir)
    [name.c:249:af_inet_client_get_remote_sockaddr] 0-glusterfs: DNS resolution 
failed on host non-existent
    [fuse-bridge.c:5260:fini] 0-fuse: Unmounting '/var/mydir'.
    [glusterfsd.c:1910:main] 0-/usr/sbin/glusterfs: Started running 
/usr/sbin/glusterfs version 3.4.2 (/usr/sbin/glusterfs --volfile-id=/mydir 
--volfile-server=mygluster-bak /var/mydir)
    [fuse-bridge.c:5016:init] 0-fuse: Mountpoint /var/mydir seems to have a 
stale mount, run 'umount /var/mydir' and try again.
    [xlator.c:390:xlator_init] 0-fuse: Initialization of volume 'fuse' failed, 
review your volfile again

## Log analyses and debugging

The "Mountpoint /var/mydir seems to have a stale mount, run 'umount
/var/mydir' and try again" log helps a lot.

I've changed the /sbin/mount.glusterfs script to increase verbosity and
discovered some more useful info:

- At first, mount.glusterfs runs: /usr/sbin/glusterfs --volfile-id=/mydir 
--volfile-server=non-existent /var/mydir
- After, it runs 'stat -c %i /var/mydir' to test if the inode is one (mount 
successful at this mount point) or another number. In a normal mount try 
(running 'mount /var/mydir' after the boot), this step returns a large number 
like 4198417. However, during the boot, it returns no output and prints the 
following error to stderr: **stat: cannot stat ‘/var/mydir’: Transport endpoint 
is not connected**;
- In a second moment, mount.glusterfs runs: /usr/sbin/glusterfs 
--volfile-id=/mydir --volfile-server=mygluster-bak /var/mydir;
- Again, it runs 'stat -c %i /var/mydir' and got the error **Transport endpoint 
is not connected**;
- At the end, mount.glusterfs prints "Mount failed. Please check the log file 
for more details.", runs "umount /var/mydir" and exits with status 1.

So, I have done some tests to got more info about the "Transport
endpoint is not connected" error and I discovered that it occurs for a
very shot time after a mount error. It's possible to got this error at
any time. The following command will sometimes reproduce this error
(it's sporadic):

    $ /usr/sbin/glusterfs --volfile-id=/mydir --volfile-server=non-existent 
/var/mydir; stat -c %i /var/mydir
    4198417
    $ /usr/sbin/glusterfs --volfile-id=/mydir --volfile-server=non-existent 
/var/mydir; stat -c %i /var/mydir
    stat: cannot stat ‘/var/mydir’: Transport endpoint is not connected

To get more debug info yet, I've modified the script mount.glusterfs
again to run 'fuser -m /var/mydir' after the first 'stat' / "Transport
endpoint is not connected"  to get any PIDs using the filesystem and
got:

    $ fuser -m /var/mydir:
         1   371  1280  1758  2287  2503
    $ ps -ww -up 1 371 1280 1758 2287 2503:
    USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
    root         1 16.8  0.0  34156  3532 ?        Ss   16:36   0:04 /sbin/init
    root       371  0.1  0.0  20132   976 ?        S    16:36   0:00 
@sbin/plymouthd --mode=boot --attach-to-session 
--pid-file=/run/initramfs/plymouth.pid
    statd     1280  0.0  0.0  21540  1396 ?        Ss   16:36   0:00 rpc.statd 
-L
    syslog    1758  0.0  0.0 255840  1216 ?        Ssl  16:36   0:00 rsyslogd

Unfortunately, I do not get the output of PIDs 2287 and 2503.

I don't know if the second mount error is related to these PIDs returned
from 'fuser' or if it's normal, and if it's related to the "Transport
endpoint is not connected" error, but it can be a start point.

## Possible solutions and workaround

I studied about the "seems to have a stale mount, run 'umount ...' and
try again" message and discovered this commit at the upstream code:
https://github.com/gluster/glusterfs/commit/08041c.

The commit message contains the message "Also, mount.glusterfs script unmounts 
mount-point on mount failure to
prevent hung mounts". This message is about the umount line in mount.glusterfs: 
https://github.com/gluster/glusterfs/commit/08041c#diff-7829823331339149cb845ff035efff54R165.

I do not know if running umount (as implemented after the last mount
error and suggested in the gluster log) solves any "Transport endpoint
not connected error" or only other specific mount hang, but a possible
solution is adding this line before the second mount (after the first
failure):

--- mount.glusterfs.orig        2015-06-12 01:02:18.943119823 -0300
+++ mount.glusterfs     2015-06-12 01:24:52.824311071 -0300
@@ -226,6 +226,7 @@
     if [ $inode -ne 1 ]; then
         err=1;
         if [ -n "$cmd_line1" ]; then
+            umount $mount_point > /dev/null 2>&1;
             cmd_line1=$(echo "$cmd_line1 $mount_point");
             $cmd_line1;
             err=0;

After this "patch", the mount point using the backupvolfile-server (that
failed most at the time) worked at most times, however it still failing
sometimes. The solution the always solved the mount was:

--- mount.glusterfs.orig        2015-06-12 01:02:18.943119823 -0300
+++ mount.glusterfs     2015-06-12 01:28:07.610199716 -0300
@@ -226,6 +226,7 @@
     if [ $inode -ne 1 ]; then
         err=1;
         if [ -n "$cmd_line1" ]; then
+            sleep 0.1;
             cmd_line1=$(echo "$cmd_line1 $mount_point");
             $cmd_line1;
             err=0;

I tested the last patch using many reboots (more then 60) and in all of
them the mount worked.

### Why do not backport some fix from the upstream

Since commit https://github.com/gluster/glusterfs/commit/b610f1,
upstream doesn't use two mounts in mount.glusterfs anymore (it uses
another solution). So, although this commit solves the problem, it makes
no sense to use the upstream solution because it changes the gluster
client behavior and should not be used in a bugfix update.

** Affects: glusterfs (Ubuntu)
     Importance: Undecided
         Status: New

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1464494

Title:
  Gluster mount using volfile-bak fails on boot

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/glusterfs/+bug/1464494/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to