I happened to review the status of volume clients and realized they were reporting a mix of different op-versions: 3.13 clients were still connecting to the downgraded 3.12 server (likely a timing issue between downgrading clients and mounting volumes). Remounting the reported clients has resulted in the correct op-version all around and about a week free of these errors.

On 2018-03-01 12:38, Ryan Lee wrote:
Thanks for your response - is there more that would be useful in addition to what I already attached?  We're logging at default level on the brick side and at error on clients.  I could turn it up for a few days to try to catch this problem in action (it's happened several more times since I first wrote).

On 2018-02-28 18:38, Raghavendra Gowdappa wrote:
Is it possible to attach logfiles of problematic client and bricks?

On Thu, Mar 1, 2018 at 3:00 AM, Ryan Lee <ryan...@zepheira.com <mailto:ryan...@zepheira.com>> wrote:

    We've been on the Gluster 3.7 series for several years with things
    pretty stable.  Given that it's reached EOL, yesterday I upgraded to
    3.13.2.  Every Gluster mount and server was disabled then brought
    back up after the upgrade, changing the op-version to 31302 and then
    trying it all out.

    It went poorly.  Every sizable read and write (100's MB) lead to
    'Transport endpoint not connected' errors on the command line and
    immediate unavailability of the mount.  After unsuccessfully trying
    to search for similar problems with solutions, I ended up
    downgrading to 3.12.6 and changing the op-version to 31202.  That
    brought us back to usability with the majority of those operations
    succeeding enough to consider it online, but there are still
    occasional mount disconnects that we never saw with 3.7 - about 6 in
    the past 18 hours.  It seems these disconnects would never come
    back, either, unless manually re-mounted.  Manually remounting
    reconnects immediately.  They only disconnect the affected client,
    though some simultaneous disconnects have occurred due to
    simultaneous activity.  The lower-level log info seems to indicate a
    socket problem, potentially broken on the client side based on
    timing (but the timing is narrow, and I won't claim the clocks are
    that well synchronized across all our servers).  The client and one
    server claim a socket polling error with no data available, and the
    other server claims a writev error.  This seems to lead the client
    to the 'all subvolumes are down' state, even though all other
    clients are still connected.  Has anybody run into this?  Did I miss
    anything moving so many versions ahead?

    I've included the output of volume info and some excerpts from the
    logs.   We have two servers running glusterd and two replica volumes
    with a brick on each server.  Both experience disconnects; there are
    about 10 clients for each, with one using both.  We use SSL over
    internal IPv4. Names in all caps were replaced, as were IP addresses.

    Let me know if there's anything else I can provide.

    % gluster v info VOL
    Volume Name: VOL
    Type: Replicate
    Volume ID: 3207155f-02c6-447a-96c4-5897917345e0
    Status: Started
    Snapshot Count: 0
    Number of Bricks: 1 x 2 = 2
    Transport-type: tcp
    Bricks:
    Brick1: SERVER1:/glusterfs/VOL-brick1/data
    Brick2: SERVER2:/glusterfs/VOL-brick2/data
    Options Reconfigured:
    config.transport: tcp
    features.selinux: off
    transport.address-family: inet
    nfs.disable: on
    client.ssl: on
    performance.readdir-ahead: on
    auth.ssl-allow: [NAMES, including CLIENT]
    server.ssl: on
    ssl.certificate-depth: 3

    Log excerpts (there was nothing related in glusterd.log):

    CLIENT:/var/log/glusterfs/mnt-VOL.log
    [2018-02-28 19:35:58.378334] E [socket.c:2648:socket_poller]
    0-VOL-client-1: socket_poller SERVER2:49153 failed (No data available)
    [2018-02-28 19:35:58.477154] E [MSGID: 108006]
    [afr-common.c:5164:__afr_handle_child_down_event] 0-VOL-replicate-0:
    All subvolumes are down. Going offline until atleast one of them
    comes back up.
    [2018-02-28 19:35:58.486146] E [MSGID: 101046]
    [dht-common.c:1501:dht_lookup_dir_cbk] 0-VOL-dht: dict is null <67
    times>
    <lots of saved_frames_unwind messages>
    [2018-02-28 19:38:06.428607] E [socket.c:2648:socket_poller]
    0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)
    [2018-02-28 19:40:12.548650] E [socket.c:2648:socket_poller]
    0-VOL-client-1: socket_poller SERVER2:24007 failed (No data available)

    <manual umount / mount>


    SERVER2:/var/log/glusterfs/bricks/VOL-brick2.log
    [2018-02-28 19:35:58.379953] E [socket.c:2632:socket_poller]
    0-tcp.VOL-server: poll error on socket
    [2018-02-28 19:35:58.380530] I [MSGID: 115036]
    [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
    connection from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
    [2018-02-28 19:35:58.380932] I [socket.c:3672:socket_submit_reply]
    0-tcp.VOL-server: not connected (priv->connected = -1)
    [2018-02-28 19:35:58.380960] E [rpcsvc.c:1364:rpcsvc_submit_generic]
    0-rpc-service: failed to submit message (XID: 0xa4e, Program:
    GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
    (tcp.uploads-server)
    [2018-02-28 19:35:58.381124] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
    [0x7f97bd37ee6a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
    [0x7f97bcf1f4c8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
    [0x7f97bcf0abd5] ) 0-: Reply submission failed
    [2018-02-28 19:35:58.381196] I [MSGID: 101055]
    [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
    connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-1-0-0
    [2018-02-28 19:40:58.351350] I [addr.c:55:compare_addr_and_update]
    0-/glusterfs/uploads-brick2/data: allowed = "*", received addr =
    "CLIENT"
    [2018-02-28 19:40:58.351684] I [login.c:34:gf_auth] 0-auth/login:
    connecting user name: CLIENT

    SERVER1:/var/log/glusterfs/bricks/VOL-brick1.log
    [2018-02-28 19:35:58.509713] W [socket.c:593:__socket_rwv]
    0-tcp.VOL-server: writev on CLIENT:49150 failed (No data available)
    [2018-02-28 19:35:58.509839] E [socket.c:2632:socket_poller]
    0-tcp.VOL-server: poll error on socket
    [2018-02-28 19:35:58.509957] I [MSGID: 115036]
    [server.c:527:server_rpc_notify] 0-VOL-server: disconnecting
    connection from CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
    [2018-02-28 19:35:58.510258] I [socket.c:3672:socket_submit_reply]
    0-tcp.VOL-server: not connected (priv->connected = -1)
    [2018-02-28 19:35:58.510281] E [rpcsvc.c:1364:rpcsvc_submit_generic]
    0-rpc-service: failed to submit message (XID: 0x4b3f, Program:
    GlusterFS 3.3, ProgVers: 330, Proc: 25) to rpc-transport
    (tcp.VOL-server)
    [2018-02-28 19:35:58.510357] E [server.c:195:server_submit_reply]
(-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/debug/io-stats.so(+0x1ae6a)
    [0x7f85bb7a8e6a]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x1d4c8)
    [0x7f85bb3494c8]
-->/usr/lib/x86_64-linux-gnu/glusterfs/3.12.6/xlator/protocol/server.so(+0x8bd5)
    [0x7f85bb334bd5] ) 0-: Reply submission failed
    [2018-02-28 19:35:58.510409] I [MSGID: 101055]
    [client_t.c:443:gf_client_unref] 0-VOL-server: Shutting down
    connection CLIENT-30688-2018/02/28-03:11:39:784734-VOL-client-0-0-0
    [2018-02-28 19:40:58.364068] I [addr.c:55:compare_addr_and_update]
    0-/glusterfs/uploads-brick1/data: allowed = "*", received addr =
    "CLIENT"
    [2018-02-28 19:40:58.364137] I [login.c:34:gf_auth] 0-auth/login:
    connecting user name: CLIENT
    _______________________________________________
    Gluster-users mailing list
    Gluster-users@gluster.org <mailto:Gluster-users@gluster.org>
    http://lists.gluster.org/mailman/listinfo/gluster-users
    <http://lists.gluster.org/mailman/listinfo/gluster-users>




_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://lists.gluster.org/mailman/listinfo/gluster-users

Reply via email to