Hi, Just an update on this - we made our ACLs much, much stricter around gluster ports and to my knowledge haven't seen a brick death since.
Ben On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker <btas...@swiftserve.com> wrote: > Hi Xavi, > > We don't that I'm explicitly aware of, *but* I can't rule it out as a > probability as it's possible some of our partners do (some/most certainly > have scans done as part of pentests fairly regularly). > > But, that does at least give me an avenue to pursue in the meantime, > thanks! > > Ben > > On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez <jaher...@redhat.com> > wrote: > >> Hi Ben, >> >> I've recently seen some issues that seem similar to yours (based on the >> stack trace in the logs). Right now it seems that in these cases the >> problem is caused by some port scanning tool that triggers an unhandled >> condition. We are still investigating what is causing this to fix it as >> soon as possible. >> >> Do you have one of these tools on your network ? >> >> Regards, >> >> Xavi >> >> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker <btas...@swiftserve.com> >> wrote: >> >>> Hi, >>> >>> A little while ago we had an issue with Gluster 6. As it was urgent we >>> downgraded to Gluster 5.9 and it went away. >>> >>> Some boxes are now running 5.10 and the issue has come back. >>> >>> From the operators point of view, the first you know about this is >>> getting reports that the transport endpoint is not connected: >>> >>> OSError: [Errno 107] Transport endpoint is not connected: >>> '/shared/lfd/benfusetestlfd' >>> >>> >>> If we check, we can see that the brick process has died >>> >>> # gluster volume status >>> Status of volume: shared >>> Gluster process TCP Port RDMA Port Online Pid >>> ------------------------------------------------------------------------------ >>> Brick fa01.gl:/data1/gluster N/A N/A N N/A >>> Brick fa02.gl:/data1/gluster N/A N/A N N/A >>> Brick fa01.gl:/data2/gluster 49153 0 Y >>> 14136 >>> Brick fa02.gl:/data2/gluster 49153 0 Y >>> 14154 >>> NFS Server on localhost N/A N/A N N/A >>> Self-heal Daemon on localhost N/A N/A Y >>> 186193 >>> NFS Server on fa01.gl N/A N/A N N/A >>> Self-heal Daemon on fa01.gl N/A N/A Y >>> 6723 >>> >>> >>> Looking in the brick logs, we can see that the process crashed, and we >>> get a backtrace (of sorts) >>> >>> >gen=110, slot->fd=17 >>> pending frames: >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 11 >>> time of crash: >>> 2019-07-04 09:42:43 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 6.1 >>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0] >>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4] >>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280] >>> /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc] >>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286] >>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5] >>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead] >>> >>> >>> Other than that, there's not a lot in the logs. In syslog we can see the >>> client (Gluster's FS is mounted on the boxes) complaining that the brick's >>> gone away. >>> >>> Software versions (for when this was happening with 6): >>> >>> # rpm -qa | grep glus >>> glusterfs-libs-6.1-1.el7.x86_64 >>> glusterfs-cli-6.1-1.el7.x86_64 >>> centos-release-gluster6-1.0-1.el7.centos.noarch >>> glusterfs-6.1-1.el7.x86_64 >>> glusterfs-api-6.1-1.el7.x86_64 >>> glusterfs-server-6.1-1.el7.x86_64 >>> glusterfs-client-xlators-6.1-1.el7.x86_64 >>> glusterfs-fuse-6.1-1.el7.x86_64 >>> >>> >>> This was happening pretty regularly (uncomfortably so) on boxes running >>> Gluster 6. Grepping through the brick logs it's always a segfault or >>> sigabrt that leads to brick death >>> >>> # grep "signal received:" data* >>> data1-gluster.log:signal received: 11 >>> data1-gluster.log:signal received: 6 >>> data1-gluster.log:signal received: 6 >>> data1-gluster.log:signal received: 11 >>> data2-gluster.log:signal received: 6 >>> >>> There's no apparent correlation on times or usage levels that we could >>> see. The issue was occurring on a wide array of hardware, spread across the >>> globe (but always talking to local - i.e. LAN - peers). All the same, disks >>> were checked, RAM checked etc. >>> >>> Digging through the logs we were able to find the lines just as the >>> crash occurs >>> >>> [2019-07-07 06:37:00.213490] I [MSGID: 108031] >>> [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting >>> local read_child shared-client-2 >>> [2019-07-07 06:37:03.544248] E [MSGID: 108008] >>> [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: >>> Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain >>> observed. [Input/output error] >>> [2019-07-07 06:37:03.544312] W [MSGID: 0] >>> [dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume >>> shared-replicate-1 returned -1 >>> [2019-07-07 06:37:03.545317] E [MSGID: 108008] >>> [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: >>> Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain >>> observed. [Input/output error] >>> [2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk] >>> 0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1 >>> (Input/output error) >>> >>> But, it's not the first time that had occurred, so may be completely >>> unrelated. >>> >>> When this happens, restarting gluster buys some time. It may just be >>> coincidental, but our searches through the logs showed *only* the first >>> brick process dying, processes for other bricks (some of the boxes have 4) >>> don't appear to be affected by this. >>> >>> As we had lots and lots of Gluster machines failing across the network, >>> at this point we stopped investigating and I came up with a downgrade >>> procedure so that we could get production back into a usable state. >>> Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue >>> just went away. Unfortunately other demands came up, so no-one was able to >>> follow up on it. >>> >>> Tonight though, there's been a brick process fail on a 5.10 machine with >>> an all too familiar looking BT >>> >>> [2019-12-10 17:20:01.708601] I [MSGID: 115029] >>> [server-handshake.c:537:server_setvolume] 0-shared-server: accepted client >>> from >>> CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 >>> (version: 5.1 >>> 0) >>> [2019-12-10 17:20:01.745940] I [MSGID: 115036] >>> [server.c:469:server_rpc_notify] 0-shared-server: disconnecting connection >>> from >>> CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 >>> [2019-12-10 17:20:01.746090] I [MSGID: 101055] >>> [client_t.c:435:gf_client_unref] 0-shared-server: Shutting down connection >>> CTX_ID:84c0d874-4c60-4a49-80f8-b344f3b376ba-GRAPH_ID:0-PID:33972-HOST:fa02.vn10.swiftserve.com-PC_NAME:shared-client-4-RECON_NO:-0 >>> pending frames: >>> patchset: git://git.gluster.org/glusterfs.git >>> signal received: 11 >>> time of crash: >>> 2019-12-10 17:21:36 >>> configuration details: >>> argp 1 >>> backtrace 1 >>> dlfcn 1 >>> libpthread 1 >>> llistxattr 1 >>> setfsid 1 >>> spinlock 1 >>> epoll.h 1 >>> xattr.h 1 >>> st_atim.tv_nsec 1 >>> package-string: glusterfs 5.10 >>> /lib64/libglusterfs.so.0(+0x26650)[0x7f6a1c6f3650] >>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f6a1c6fdc04] >>> /lib64/libc.so.6(+0x363b0)[0x7f6a1ad543b0] >>> /usr/lib64/glusterfs/5.10/rpc-transport/socket.so(+0x9e3b)[0x7f6a112dae3b] >>> /lib64/libglusterfs.so.0(+0x8aab9)[0x7f6a1c757ab9] >>> /lib64/libpthread.so.0(+0x7e65)[0x7f6a1b556e65] >>> /lib64/libc.so.6(clone+0x6d)[0x7f6a1ae1c88d] >>> --------- >>> >>> >>> Versions this time are >>> >>> # rpm -qa | grep glus >>> glusterfs-server-5.10-1.el7.x86_64 >>> centos-release-gluster5-1.0-1.el7.centos.noarch >>> glusterfs-fuse-5.10-1.el7.x86_64 >>> glusterfs-libs-5.10-1.el7.x86_64 >>> glusterfs-client-xlators-5.10-1.el7.x86_64 >>> glusterfs-api-5.10-1.el7.x86_64 >>> glusterfs-5.10-1.el7.x86_64 >>> glusterfs-cli-5.10-1.el7.x86_64 >>> >>> >>> These boxes have been running 5.10 for less than 48 hours >>> >>> Has anyone else run into this? Assuming the root is the same (it's a >>> fairly limited BT, so hard to say for sure), was something from 6 >>> backported into 5.10? >>> >>> Thanks >>> >>> Ben >>> ________ >>> >>> Community Meeting Calendar: >>> >>> APAC Schedule - >>> Every 2nd and 4th Tuesday at 11:30 AM IST >>> Bridge: https://bluejeans.com/441850968 >>> >>> NA/EMEA Schedule - >>> Every 1st and 3rd Tuesday at 01:00 PM EDT >>> Bridge: https://bluejeans.com/441850968 >>> >>> Gluster-users mailing list >>> Gluster-users@gluster.org >>> https://lists.gluster.org/mailman/listinfo/gluster-users >>> >>
________ Community Meeting Calendar: APAC Schedule - Every 2nd and 4th Tuesday at 11:30 AM IST Bridge: https://bluejeans.com/441850968 NA/EMEA Schedule - Every 1st and 3rd Tuesday at 01:00 PM EDT Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users