[Gluster-users] Gluster Periodic Brick Process Deaths

2019-12-10 Thread Ben Tasker
Hi,

A little while ago we had an issue with Gluster 6. As it was urgent we
downgraded to Gluster 5.9 and it went away.

Some boxes are now running 5.10 and the issue has come back.

>From the operators point of view, the first you know about this is getting
reports that the transport endpoint is not connected:

OSError: [Errno 107] Transport endpoint is not connected:
'/shared/lfd/benfusetestlfd'


If we check, we can see that the brick process has died

# gluster volume status
Status of volume: shared
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick fa01.gl:/data1/glusterN/A   N/AN   N/A
Brick fa02.gl:/data1/glusterN/A   N/AN   N/A
Brick fa01.gl:/data2/gluster49153 0  Y   14136
Brick fa02.gl:/data2/gluster49153 0  Y   14154
NFS Server on localhost N/A   N/AN   N/A
Self-heal Daemon on localhost   N/A   N/AY   186193
NFS Server on fa01.gl   N/A   N/AN   N/A
Self-heal Daemon on fa01.gl N/A   N/AY   6723


Looking in the brick logs, we can see that the process crashed, and we get
a backtrace (of sorts)

>gen=110, slot->fd=17
pending frames:
patchset: git://git.gluster.org/glusterfs.git
signal received: 11
time of crash:
2019-07-04 09:42:43
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.1
/lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
/lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
/lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
/lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]


Other than that, there's not a lot in the logs. In syslog we can see the
client (Gluster's FS is mounted on the boxes) complaining that the brick's
gone away.

Software versions (for when this was happening with 6):

# rpm -qa | grep glus
glusterfs-libs-6.1-1.el7.x86_64
glusterfs-cli-6.1-1.el7.x86_64
centos-release-gluster6-1.0-1.el7.centos.noarch
glusterfs-6.1-1.el7.x86_64
glusterfs-api-6.1-1.el7.x86_64
glusterfs-server-6.1-1.el7.x86_64
glusterfs-client-xlators-6.1-1.el7.x86_64
glusterfs-fuse-6.1-1.el7.x86_64


This was happening pretty regularly (uncomfortably so) on boxes running
Gluster 6. Grepping through the brick logs it's always a segfault or
sigabrt that leads to brick death

# grep "signal received:" data*
data1-gluster.log:signal received: 11
data1-gluster.log:signal received: 6
data1-gluster.log:signal received: 6
data1-gluster.log:signal received: 11
data2-gluster.log:signal received: 6

There's no apparent correlation on times or usage levels that we could see.
The issue was occurring on a wide array of hardware, spread across the
globe (but always talking to local - i.e. LAN - peers). All the same, disks
were checked, RAM checked etc.

Digging through the logs we were able to find the lines just as the crash
occurs

[2019-07-07 06:37:00.213490] I [MSGID: 108031]
[afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1:
selecting local read_child shared-client-2
[2019-07-07 06:37:03.544248] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done]
0-shared-replicate-1: Failing SETATTR on gfid
a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain observed.
[Input/output error]
[2019-07-07 06:37:03.544312] W [MSGID: 0]
[dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht:
subvolume shared-replicate-1 returned -1
[2019-07-07 06:37:03.545317] E [MSGID: 108008]
[afr-transaction.c:2877:afr_write_txn_refresh_done]
0-shared-replicate-1: Failing SETATTR on gfid
a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain observed.
[Input/output error]
[2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk]
0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1
(Input/output error)

But, it's not the first time that had occurred, so may be completely
unrelated.

When this happens, restarting gluster buys some time. It may just be
coincidental, but our searches through the logs showed *only* the first
brick process dying, processes for other bricks (some of the boxes have 4)
don't appear to be affected by this.

As we had lots and lots of Gluster machines failing across the network, at
this point we stopped investigating and I came up with a downgrade
procedure so that we could get production back into a usable state.
Machines running Gluster 6 were downgraded to Gluster 5.9 and the issue
just went away. Unfortunately other demands came up, so no-one was able to
follow up on it.

Tonight though, there's been a brick process fail on 

Re: [Gluster-users] Gluster Periodic Brick Process Deaths

2019-12-11 Thread Xavi Hernandez
Hi Ben,

I've recently seen some issues that seem similar to yours (based on the
stack trace in the logs). Right now it seems that in these cases the
problem is caused by some port scanning tool that triggers an unhandled
condition. We are still investigating what is causing this to fix it as
soon as possible.

Do you have one of these tools on your network ?

Regards,

Xavi

On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker  wrote:

> Hi,
>
> A little while ago we had an issue with Gluster 6. As it was urgent we
> downgraded to Gluster 5.9 and it went away.
>
> Some boxes are now running 5.10 and the issue has come back.
>
> From the operators point of view, the first you know about this is getting
> reports that the transport endpoint is not connected:
>
> OSError: [Errno 107] Transport endpoint is not connected: 
> '/shared/lfd/benfusetestlfd'
>
>
> If we check, we can see that the brick process has died
>
> # gluster volume status
> Status of volume: shared
> Gluster process TCP Port  RDMA Port  Online  Pid
> --
> Brick fa01.gl:/data1/glusterN/A   N/AN   N/A
> Brick fa02.gl:/data1/glusterN/A   N/AN   N/A
> Brick fa01.gl:/data2/gluster49153 0  Y   14136
> Brick fa02.gl:/data2/gluster49153 0  Y   14154
> NFS Server on localhost N/A   N/AN   N/A
> Self-heal Daemon on localhost   N/A   N/AY   
> 186193
> NFS Server on fa01.gl   N/A   N/AN   N/A
> Self-heal Daemon on fa01.gl N/A   N/AY   6723
>
>
> Looking in the brick logs, we can see that the process crashed, and we get
> a backtrace (of sorts)
>
> >gen=110, slot->fd=17
> pending frames:
> patchset: git://git.gluster.org/glusterfs.git
> signal received: 11
> time of crash:
> 2019-07-04 09:42:43
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 6.1
> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
> /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>
>
> Other than that, there's not a lot in the logs. In syslog we can see the
> client (Gluster's FS is mounted on the boxes) complaining that the brick's
> gone away.
>
> Software versions (for when this was happening with 6):
>
> # rpm -qa | grep glus
> glusterfs-libs-6.1-1.el7.x86_64
> glusterfs-cli-6.1-1.el7.x86_64
> centos-release-gluster6-1.0-1.el7.centos.noarch
> glusterfs-6.1-1.el7.x86_64
> glusterfs-api-6.1-1.el7.x86_64
> glusterfs-server-6.1-1.el7.x86_64
> glusterfs-client-xlators-6.1-1.el7.x86_64
> glusterfs-fuse-6.1-1.el7.x86_64
>
>
> This was happening pretty regularly (uncomfortably so) on boxes running
> Gluster 6. Grepping through the brick logs it's always a segfault or
> sigabrt that leads to brick death
>
> # grep "signal received:" data*
> data1-gluster.log:signal received: 11
> data1-gluster.log:signal received: 6
> data1-gluster.log:signal received: 6
> data1-gluster.log:signal received: 11
> data2-gluster.log:signal received: 6
>
> There's no apparent correlation on times or usage levels that we could
> see. The issue was occurring on a wide array of hardware, spread across the
> globe (but always talking to local - i.e. LAN - peers). All the same, disks
> were checked, RAM checked etc.
>
> Digging through the logs we were able to find the lines just as the crash
> occurs
>
> [2019-07-07 06:37:00.213490] I [MSGID: 108031] 
> [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting 
> local read_child shared-client-2
> [2019-07-07 06:37:03.544248] E [MSGID: 108008] 
> [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: 
> Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain 
> observed. [Input/output error]
> [2019-07-07 06:37:03.544312] W [MSGID: 0] 
> [dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume 
> shared-replicate-1 returned -1
> [2019-07-07 06:37:03.545317] E [MSGID: 108008] 
> [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: 
> Failing SETATTR on gfid a8dd2910-ff64-4ced-81ef-01852b7094ae: split-brain 
> observed. [Input/output error]
> [2019-07-07 06:37:03.545382] W [fuse-bridge.c:1583:fuse_setattr_cbk] 
> 0-glusterfs-fuse: 2241437: SETATTR() /lfd/benfusetestlfd/_logs => -1 
> (Input/output error)
>
> But, it's not the first time that had occurred, so may be completely
> unrelated.
>
> When this happens, res

Re: [Gluster-users] Gluster Periodic Brick Process Deaths

2019-12-11 Thread Ben Tasker
Hi Xavi,

We don't that I'm explicitly aware of, *but* I can't rule it out as a
probability as it's possible some of our partners do (some/most certainly
have scans done as part of pentests fairly regularly).

But, that does at least give me an avenue to pursue in the meantime, thanks!

Ben

On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez  wrote:

> Hi Ben,
>
> I've recently seen some issues that seem similar to yours (based on the
> stack trace in the logs). Right now it seems that in these cases the
> problem is caused by some port scanning tool that triggers an unhandled
> condition. We are still investigating what is causing this to fix it as
> soon as possible.
>
> Do you have one of these tools on your network ?
>
> Regards,
>
> Xavi
>
> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker  wrote:
>
>> Hi,
>>
>> A little while ago we had an issue with Gluster 6. As it was urgent we
>> downgraded to Gluster 5.9 and it went away.
>>
>> Some boxes are now running 5.10 and the issue has come back.
>>
>> From the operators point of view, the first you know about this is
>> getting reports that the transport endpoint is not connected:
>>
>> OSError: [Errno 107] Transport endpoint is not connected: 
>> '/shared/lfd/benfusetestlfd'
>>
>>
>> If we check, we can see that the brick process has died
>>
>> # gluster volume status
>> Status of volume: shared
>> Gluster process TCP Port  RDMA Port  Online  Pid
>> --
>> Brick fa01.gl:/data1/glusterN/A   N/AN   N/A
>> Brick fa02.gl:/data1/glusterN/A   N/AN   N/A
>> Brick fa01.gl:/data2/gluster49153 0  Y   
>> 14136
>> Brick fa02.gl:/data2/gluster49153 0  Y   
>> 14154
>> NFS Server on localhost N/A   N/AN   N/A
>> Self-heal Daemon on localhost   N/A   N/AY   
>> 186193
>> NFS Server on fa01.gl   N/A   N/AN   N/A
>> Self-heal Daemon on fa01.gl N/A   N/AY   6723
>>
>>
>> Looking in the brick logs, we can see that the process crashed, and we
>> get a backtrace (of sorts)
>>
>> >gen=110, slot->fd=17
>> pending frames:
>> patchset: git://git.gluster.org/glusterfs.git
>> signal received: 11
>> time of crash:
>> 2019-07-04 09:42:43
>> configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> spinlock 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 6.1
>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
>> /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>>
>>
>> Other than that, there's not a lot in the logs. In syslog we can see the
>> client (Gluster's FS is mounted on the boxes) complaining that the brick's
>> gone away.
>>
>> Software versions (for when this was happening with 6):
>>
>> # rpm -qa | grep glus
>> glusterfs-libs-6.1-1.el7.x86_64
>> glusterfs-cli-6.1-1.el7.x86_64
>> centos-release-gluster6-1.0-1.el7.centos.noarch
>> glusterfs-6.1-1.el7.x86_64
>> glusterfs-api-6.1-1.el7.x86_64
>> glusterfs-server-6.1-1.el7.x86_64
>> glusterfs-client-xlators-6.1-1.el7.x86_64
>> glusterfs-fuse-6.1-1.el7.x86_64
>>
>>
>> This was happening pretty regularly (uncomfortably so) on boxes running
>> Gluster 6. Grepping through the brick logs it's always a segfault or
>> sigabrt that leads to brick death
>>
>> # grep "signal received:" data*
>> data1-gluster.log:signal received: 11
>> data1-gluster.log:signal received: 6
>> data1-gluster.log:signal received: 6
>> data1-gluster.log:signal received: 11
>> data2-gluster.log:signal received: 6
>>
>> There's no apparent correlation on times or usage levels that we could
>> see. The issue was occurring on a wide array of hardware, spread across the
>> globe (but always talking to local - i.e. LAN - peers). All the same, disks
>> were checked, RAM checked etc.
>>
>> Digging through the logs we were able to find the lines just as the crash
>> occurs
>>
>> [2019-07-07 06:37:00.213490] I [MSGID: 108031] 
>> [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting 
>> local read_child shared-client-2
>> [2019-07-07 06:37:03.544248] E [MSGID: 108008] 
>> [afr-transaction.c:2877:afr_write_txn_refresh_done] 0-shared-replicate-1: 
>> Failing SETATTR on gfid a9565e4b-9148-4969-91e8-ba816aea8f6a: split-brain 
>> observed. [Input/output error]
>> [2019-07-07 06:37:03.544312] W [MSGID: 0] 
>> [dht-inode-write.c:1156:dht_non_mds_setattr_cbk] 0-shared-dht: subvolume 
>> shared-replicate-1 returned -1
>> [2019-07-07 06:37:03.5

Re: [Gluster-users] Gluster Periodic Brick Process Deaths

2020-01-13 Thread Ben Tasker
Hi,

Just an update on this - we made our ACLs much, much stricter around
gluster ports and to my knowledge haven't seen a brick death since.

Ben

On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker  wrote:

> Hi Xavi,
>
> We don't that I'm explicitly aware of, *but* I can't rule it out as a
> probability as it's possible some of our partners do (some/most certainly
> have scans done as part of pentests fairly regularly).
>
> But, that does at least give me an avenue to pursue in the meantime,
> thanks!
>
> Ben
>
> On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez 
> wrote:
>
>> Hi Ben,
>>
>> I've recently seen some issues that seem similar to yours (based on the
>> stack trace in the logs). Right now it seems that in these cases the
>> problem is caused by some port scanning tool that triggers an unhandled
>> condition. We are still investigating what is causing this to fix it as
>> soon as possible.
>>
>> Do you have one of these tools on your network ?
>>
>> Regards,
>>
>> Xavi
>>
>> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker 
>> wrote:
>>
>>> Hi,
>>>
>>> A little while ago we had an issue with Gluster 6. As it was urgent we
>>> downgraded to Gluster 5.9 and it went away.
>>>
>>> Some boxes are now running 5.10 and the issue has come back.
>>>
>>> From the operators point of view, the first you know about this is
>>> getting reports that the transport endpoint is not connected:
>>>
>>> OSError: [Errno 107] Transport endpoint is not connected: 
>>> '/shared/lfd/benfusetestlfd'
>>>
>>>
>>> If we check, we can see that the brick process has died
>>>
>>> # gluster volume status
>>> Status of volume: shared
>>> Gluster process TCP Port  RDMA Port  Online  Pid
>>> --
>>> Brick fa01.gl:/data1/glusterN/A   N/AN   N/A
>>> Brick fa02.gl:/data1/glusterN/A   N/AN   N/A
>>> Brick fa01.gl:/data2/gluster49153 0  Y   
>>> 14136
>>> Brick fa02.gl:/data2/gluster49153 0  Y   
>>> 14154
>>> NFS Server on localhost N/A   N/AN   N/A
>>> Self-heal Daemon on localhost   N/A   N/AY   
>>> 186193
>>> NFS Server on fa01.gl   N/A   N/AN   N/A
>>> Self-heal Daemon on fa01.gl N/A   N/AY   
>>> 6723
>>>
>>>
>>> Looking in the brick logs, we can see that the process crashed, and we
>>> get a backtrace (of sorts)
>>>
>>> >gen=110, slot->fd=17
>>> pending frames:
>>> patchset: git://git.gluster.org/glusterfs.git
>>> signal received: 11
>>> time of crash:
>>> 2019-07-04 09:42:43
>>> configuration details:
>>> argp 1
>>> backtrace 1
>>> dlfcn 1
>>> libpthread 1
>>> llistxattr 1
>>> setfsid 1
>>> spinlock 1
>>> epoll.h 1
>>> xattr.h 1
>>> st_atim.tv_nsec 1
>>> package-string: glusterfs 6.1
>>> /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
>>> /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
>>> /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
>>> /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
>>> /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
>>> /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
>>> /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]
>>>
>>>
>>> Other than that, there's not a lot in the logs. In syslog we can see the
>>> client (Gluster's FS is mounted on the boxes) complaining that the brick's
>>> gone away.
>>>
>>> Software versions (for when this was happening with 6):
>>>
>>> # rpm -qa | grep glus
>>> glusterfs-libs-6.1-1.el7.x86_64
>>> glusterfs-cli-6.1-1.el7.x86_64
>>> centos-release-gluster6-1.0-1.el7.centos.noarch
>>> glusterfs-6.1-1.el7.x86_64
>>> glusterfs-api-6.1-1.el7.x86_64
>>> glusterfs-server-6.1-1.el7.x86_64
>>> glusterfs-client-xlators-6.1-1.el7.x86_64
>>> glusterfs-fuse-6.1-1.el7.x86_64
>>>
>>>
>>> This was happening pretty regularly (uncomfortably so) on boxes running
>>> Gluster 6. Grepping through the brick logs it's always a segfault or
>>> sigabrt that leads to brick death
>>>
>>> # grep "signal received:" data*
>>> data1-gluster.log:signal received: 11
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 6
>>> data1-gluster.log:signal received: 11
>>> data2-gluster.log:signal received: 6
>>>
>>> There's no apparent correlation on times or usage levels that we could
>>> see. The issue was occurring on a wide array of hardware, spread across the
>>> globe (but always talking to local - i.e. LAN - peers). All the same, disks
>>> were checked, RAM checked etc.
>>>
>>> Digging through the logs we were able to find the lines just as the
>>> crash occurs
>>>
>>> [2019-07-07 06:37:00.213490] I [MSGID: 108031] 
>>> [afr-common.c:2547:afr_local_discovery_cbk] 0-shared-replicate-1: selecting 
>>> local read_child shared-client-2
>>> [2019-07-07 06:37:03.544248] E [MSGID: 108008] 
>>> [afr-transaction.

Re: [Gluster-users] Gluster Periodic Brick Process Deaths

2020-01-13 Thread Xavi Hernandez
Hi Ben,

we already identified the issue that caused crashes when gluster ports were
scanned. The fix is present on 6.7 and 7.1, so if this was the reason for
your problem, those versions should help.

Best regards,

Xavi

On Mon, Jan 13, 2020 at 11:57 AM Ben Tasker  wrote:

> Hi,
>
> Just an update on this - we made our ACLs much, much stricter around
> gluster ports and to my knowledge haven't seen a brick death since.
>
> Ben
>
> On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker 
> wrote:
>
>> Hi Xavi,
>>
>> We don't that I'm explicitly aware of, *but* I can't rule it out as a
>> probability as it's possible some of our partners do (some/most certainly
>> have scans done as part of pentests fairly regularly).
>>
>> But, that does at least give me an avenue to pursue in the meantime,
>> thanks!
>>
>> Ben
>>
>> On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez 
>> wrote:
>>
>>> Hi Ben,
>>>
>>> I've recently seen some issues that seem similar to yours (based on the
>>> stack trace in the logs). Right now it seems that in these cases the
>>> problem is caused by some port scanning tool that triggers an unhandled
>>> condition. We are still investigating what is causing this to fix it as
>>> soon as possible.
>>>
>>> Do you have one of these tools on your network ?
>>>
>>> Regards,
>>>
>>> Xavi
>>>
>>> On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker 
>>> wrote:
>>>
 Hi,

 A little while ago we had an issue with Gluster 6. As it was urgent we
 downgraded to Gluster 5.9 and it went away.

 Some boxes are now running 5.10 and the issue has come back.

 From the operators point of view, the first you know about this is
 getting reports that the transport endpoint is not connected:

 OSError: [Errno 107] Transport endpoint is not connected: 
 '/shared/lfd/benfusetestlfd'


 If we check, we can see that the brick process has died

 # gluster volume status
 Status of volume: shared
 Gluster process TCP Port  RDMA Port  Online  
 Pid
 --
 Brick fa01.gl:/data1/glusterN/A   N/AN   
 N/A
 Brick fa02.gl:/data1/glusterN/A   N/AN   
 N/A
 Brick fa01.gl:/data2/gluster49153 0  Y   
 14136
 Brick fa02.gl:/data2/gluster49153 0  Y   
 14154
 NFS Server on localhost N/A   N/AN   
 N/A
 Self-heal Daemon on localhost   N/A   N/AY   
 186193
 NFS Server on fa01.gl   N/A   N/AN   
 N/A
 Self-heal Daemon on fa01.gl N/A   N/AY   
 6723


 Looking in the brick logs, we can see that the process crashed, and we
 get a backtrace (of sorts)

 >gen=110, slot->fd=17
 pending frames:
 patchset: git://git.gluster.org/glusterfs.git
 signal received: 11
 time of crash:
 2019-07-04 09:42:43
 configuration details:
 argp 1
 backtrace 1
 dlfcn 1
 libpthread 1
 llistxattr 1
 setfsid 1
 spinlock 1
 epoll.h 1
 xattr.h 1
 st_atim.tv_nsec 1
 package-string: glusterfs 6.1
 /lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
 /lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
 /lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
 /usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
 /lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
 /lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
 /lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead]


 Other than that, there's not a lot in the logs. In syslog we can see
 the client (Gluster's FS is mounted on the boxes) complaining that the
 brick's gone away.

 Software versions (for when this was happening with 6):

 # rpm -qa | grep glus
 glusterfs-libs-6.1-1.el7.x86_64
 glusterfs-cli-6.1-1.el7.x86_64
 centos-release-gluster6-1.0-1.el7.centos.noarch
 glusterfs-6.1-1.el7.x86_64
 glusterfs-api-6.1-1.el7.x86_64
 glusterfs-server-6.1-1.el7.x86_64
 glusterfs-client-xlators-6.1-1.el7.x86_64
 glusterfs-fuse-6.1-1.el7.x86_64


 This was happening pretty regularly (uncomfortably so) on boxes running
 Gluster 6. Grepping through the brick logs it's always a segfault or
 sigabrt that leads to brick death

 # grep "signal received:" data*
 data1-gluster.log:signal received: 11
 data1-gluster.log:signal received: 6
 data1-gluster.log:signal received: 6
 data1-gluster.log:signal received: 11
 data2-gluster.log:signal received: 6

 There's no apparent correlation on times or usage levels that we could
 see. The issue was occurring on a wide array of hardware, spread across the
 glo

Re: [Gluster-users] Gluster Periodic Brick Process Deaths

2020-01-14 Thread Nico van Royen
Hi, 

I am a bit surprised by this response. Quite recently we created a Redhat 
support case about the same issue (brick process crashing when scanned), and 
Redhats response was simply that the "solution" was to not scan the bricks and 
that this issue will not be resolved (RH support case #02551577). This of 
course is for Redhat's commercial GlusterFS version, currently at 
v6.0-21.el7rhgs. 

Would this fix also be ported to the RHGS version ? 
Regards, 
Nico van Roijen - ING Bank. 


Van: "Xavi Hernandez"  
Aan: "Ben Tasker"  
Cc: "gluster-users"  
Verzonden: Maandag 13 januari 2020 12:20:29 
Onderwerp: Re: [Gluster-users] Gluster Periodic Brick Process Deaths 

Hi Ben, 
we already identified the issue that caused crashes when gluster ports were 
scanned. The fix is present on 6.7 and 7.1, so if this was the reason for your 
problem, those versions should help. 

Best regards, 

Xavi 

On Mon, Jan 13, 2020 at 11:57 AM Ben Tasker < [ mailto:btas...@swiftserve.com | 
btas...@swiftserve.com ] > wrote: 



Hi, 

Just an update on this - we made our ACLs much, much stricter around gluster 
ports and to my knowledge haven't seen a brick death since. 

Ben 

On Wed, Dec 11, 2019 at 12:43 PM Ben Tasker < [ mailto:btas...@swiftserve.com | 
btas...@swiftserve.com ] > wrote: 

BQ_BEGIN

Hi Xavi, 

We don't that I'm explicitly aware of, *but* I can't rule it out as a 
probability as it's possible some of our partners do (some/most certainly have 
scans done as part of pentests fairly regularly). 

But, that does at least give me an avenue to pursue in the meantime, thanks! 

Ben 

On Wed, Dec 11, 2019 at 12:16 PM Xavi Hernandez < [ mailto:jaher...@redhat.com 
| jaher...@redhat.com ] > wrote: 

BQ_BEGIN

Hi Ben, 
I've recently seen some issues that seem similar to yours (based on the stack 
trace in the logs). Right now it seems that in these cases the problem is 
caused by some port scanning tool that triggers an unhandled condition. We are 
still investigating what is causing this to fix it as soon as possible. 

Do you have one of these tools on your network ? 

Regards, 

Xavi 

On Tue, Dec 10, 2019 at 7:53 PM Ben Tasker < [ mailto:btas...@swiftserve.com | 
btas...@swiftserve.com ] > wrote: 

BQ_BEGIN

Hi, 
A little while ago we had an issue with Gluster 6. As it was urgent we 
downgraded to Gluster 5.9 and it went away. 

Some boxes are now running 5.10 and the issue has come back. 

>From the operators point of view, the first you know about this is getting 
>reports that the transport endpoint is not connected: 

OSError: [Errno 107] Transport endpoint is not connected: 
'/shared/lfd/benfusetestlfd' 

If we check, we can see that the brick process has died 
# gluster volume status
Status of volume: shared
Gluster process TCP Port  RDMA Port  Online  Pid
--
Brick fa01.gl:/data1/glusterN/A   N/AN   N/A  
Brick fa02.gl:/data1/glusterN/A   N/AN   N/A  
Brick fa01.gl:/data2/gluster49153 0  Y   14136
Brick fa02.gl:/data2/gluster49153 0  Y   14154
NFS Server on localhost N/A   N/AN   N/A  
Self-heal Daemon on localhost   N/A   N/AY   186193
NFS Server on [ http://fa01.gl/ | fa01.gl ] N/A   N/AN   N/A  
Self-heal Daemon on [ http://fa01.gl/ | fa01.gl ] N/A   N/AY   
6723 

Looking in the brick logs, we can see that the process crashed, and we get a 
backtrace (of sorts) 
>gen=110, slot->fd=17
pending frames:
patchset: git:// [ http://git.gluster.org/glusterfs.git | 
git.gluster.org/glusterfs.git ] signal received: 11
time of crash: 
2019-07-04 09:42:43
configuration details:
argp 1
backtrace 1
dlfcn 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 6.1
/lib64/libglusterfs.so.0(+0x26db0)[0x7f79984eadb0]
/lib64/libglusterfs.so.0(gf_print_trace+0x334)[0x7f79984f57b4]
/lib64/libc.so.6(+0x36280)[0x7f7996b2a280]
/usr/lib64/glusterfs/6.1/rpc-transport/socket.so(+0xa4cc)[0x7f798c8af4cc]
/lib64/libglusterfs.so.0(+0x8c286)[0x7f7998550286]
/lib64/libpthread.so.0(+0x7dd5)[0x7f799732add5]
/lib64/libc.so.6(clone+0x6d)[0x7f7996bf1ead] 

Other than that, there's not a lot in the logs. In syslog we can see the client 
(Gluster's FS is mounted on the boxes) complaining that the brick's gone away. 

Software versions (for when this was happening with 6): 
# rpm -qa | grep glus
glusterfs-libs-6.1-1.el7.x86_64
glusterfs-cli-6.1-1.el7.x86_64
centos-release-gluster6-1.0-1.el7.centos.noarch
glusterfs-6.1-1.el7.x86_64
glusterfs-api-6.1-1.el7.x86_64
glusterfs-server-6.1-1.el7.x86_64
glusterfs-client-xlators-6.1-1.el7.x86_6