Was the suggestion to increase server.event-thread values tried? If yes, what were the results?
On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <mauro.trid...@cmcc.it> wrote: > Dear All, > > do you have any suggestions about the right way to "debug" this issue? > In attachment, the updated logs of “s06" gluster server. > > I noticed a lot of intermittent warning and error messages. > > Thank you in advance, > Mauro > > > > On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <rgowd...@redhat.com> wrote: > > > +Gluster Devel <gluster-devel@gluster.org>, +Gluster-users > <gluster-us...@gluster.org> > > I would like to point out another issue. Even if what I suggested prevents > disconnects, part of the solution would be only symptomatic treatment and > doesn't address the root cause of the problem. In most of the > ping-timer-expiry issues, the root cause is the increased load on bricks > and the inability of bricks to be responsive under high load. So, the > actual solution would be doing any or both of the following: > * identify the source of increased load and if possible throttle it. > Internal heal processes like self-heal, rebalance, quota heal are known to > pump traffic into bricks without much throttling (io-threads _might_ do > some throttling, but my understanding is its not sufficient). > * identify the reason for bricks to become unresponsive during load. This > may be fixable issues like not enough event-threads to read from network or > difficult to fix issues like fsync on backend fs freezing the process or > semi fixable issues (in code) like lock contention. > > So any genuine effort to fix ping-timer-issues (to be honest most of the > times they are not issues related to rpc/network) would involve performance > characterization of various subsystems on bricks and clients. Various > subsystems can include (but not necessarily limited to), underlying > OS/filesystem, glusterfs processes, CPU consumption etc > > regards, > Raghavendra > > On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <mauro.trid...@cmcc.it> > wrote: > >> Thank you, let’s try! >> I will inform you about the effects of the change. >> >> Regards, >> Mauro >> >> On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <rgowd...@redhat.com> >> wrote: >> >> >> >> On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <mauro.trid...@cmcc.it> >> wrote: >> >>> Hi Raghavendra, >>> >>> thank you for your reply. >>> Yes, you are right. It is a problem that seems to happen randomly. >>> At this moment, server.event-threads value is 4. I will try to increase >>> this value to 8. Do you think that it could be a valid value ? >>> >> >> Yes. We can try with that. You should see at least frequency of >> ping-timer related disconnects reduce with this value (even if it doesn't >> eliminate the problem completely). >> >> >>> Regards, >>> Mauro >>> >>> >>> On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <rgowd...@redhat.com> >>> wrote: >>> >>> >>> >>> On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <nbala...@redhat.com> >>> wrote: >>> >>>> Hi Mauro, >>>> >>>> It looks like some problem on s06. Are all your other nodes ok? Can you >>>> send us the gluster logs from this node? >>>> >>>> @Raghavendra G <raghaven...@gluster.com> , do you have any idea as to >>>> how this can be debugged? Maybe running top ? Or debug brick logs? >>>> >>> >>> If we can reproduce the problem, collecting tcpdump on both ends of >>> connection will help. But, one common problem is these bugs are >>> inconsistently reproducible and hence we may not be able to capture tcpdump >>> at correct intervals. Other than that, we can try to collect some evidence >>> that poller threads were busy (waiting on locks). But, not sure what debug >>> data provides that information. >>> >>> From what I know, its difficult to collect evidence for this issue and >>> we could only reason about it. >>> >>> We can try a workaround though - try increasing server.event-threads and >>> see whether ping-timer expiry issues go away with an optimal value. If >>> that's the case, it kind of provides proof for our hypothesis. >>> >>> >>>> >>>> Regards, >>>> Nithya >>>> >>>> On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <mauro.trid...@cmcc.it> >>>> wrote: >>>> >>>>> Hi All, >>>>> >>>>> some minutes ago I received this message from NAGIOS server >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ****** Nagios *****Notification Type: PROBLEMService: Brick - >>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState: CRITICALDate/Time: Mon >>>>> Mar 4 10:25:33 CET 2019Additional Info:CHECK_NRPE STATE CRITICAL: Socket >>>>> timeout after 10 seconds.* >>>>> >>>>> I checked the network, RAM and CPUs usage on s06 node and everything >>>>> seems to be ok. >>>>> No bricks are in error state. In /var/log/messages, I detected again a >>>>> crash of “check_vol_utili” that I think it is a module used by NRPE >>>>> executable (that is the NAGIOS client). >>>>> >>>>> Mar 4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general >>>>> protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in >>>>> libglusterfs.so.0.0.1[7facff9b7000+f7000] >>>>> Mar 4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user >>>>> 0 killed by SIGSEGV - dumping core >>>>> Mar 4 10:15:29 s06 abrt-server: Generating core_backtrace >>>>> Mar 4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': >>>>> No such file or directory >>>>> Mar 4 10:16:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 4 10:16:01 s06 systemd: Starting User Slice of root. >>>>> Mar 4 10:16:01 s06 systemd: Started Session 201010 of user root. >>>>> Mar 4 10:16:01 s06 systemd: Starting Session 201010 of user root. >>>>> Mar 4 10:16:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 4 10:16:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 4 10:16:24 s06 abrt-server: Duplicate: UUID >>>>> Mar 4 10:16:24 s06 abrt-server: DUP_OF_DIR: >>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041 >>>>> Mar 4 10:16:24 s06 abrt-server: Deleting problem directory >>>>> ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041) >>>>> Mar 4 10:16:24 s06 abrt-server: Generating core_backtrace >>>>> Mar 4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': >>>>> No such file or directory >>>>> Mar 4 10:16:24 s06 abrt-server: Cannot notify >>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event >>>>> 'report_uReport' exited with 1 >>>>> Mar 4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user >>>>> 0 killed by SIGABRT - dumping core >>>>> Mar 4 10:16:25 s06 abrt-server: Generating core_backtrace >>>>> Mar 4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': >>>>> No such file or directory >>>>> Mar 4 10:17:01 s06 systemd: Created slice User Slice of root. >>>>> >>>>> Also, I noticed the following errors that I think are very critical: >>>>> >>>>> Mar 4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: >>>>> server >>>>> 192.168.0.55:49158 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:22:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 4 10:22:01 s06 systemd: Starting User Slice of root. >>>>> Mar 4 10:22:01 s06 systemd: Started Session 201017 of user root. >>>>> Mar 4 10:22:01 s06 systemd: Starting Session 201017 of user root. >>>>> Mar 4 10:22:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 4 10:22:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: >>>>> server >>>>> 192.168.0.54:49165 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:23:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 4 10:23:01 s06 systemd: Starting User Slice of root. >>>>> Mar 4 10:23:01 s06 systemd: Started Session 201018 of user root. >>>>> Mar 4 10:23:01 s06 systemd: Starting Session 201018 of user root. >>>>> Mar 4 10:23:02 s06 systemd: Removed slice User Slice of root. >>>>> Mar 4 10:23:02 s06 systemd: Stopping User Slice of root. >>>>> Mar 4 10:24:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 4 10:24:01 s06 systemd: Starting User Slice of root. >>>>> Mar 4 10:24:01 s06 systemd: Started Session 201019 of user root. >>>>> Mar 4 10:24:01 s06 systemd: Starting Session 201019 of user root. >>>>> Mar 4 10:24:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 4 10:24:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: >>>>> server >>>>> 192.168.0.52:49158 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: server >>>>> 192.168.0.51:49153 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: >>>>> server >>>>> 192.168.0.52:49156 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: >>>>> server >>>>> 192.168.0.51:49159 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: >>>>> server >>>>> 192.168.0.52:49161 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: >>>>> server >>>>> 192.168.0.54:49165 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: >>>>> server >>>>> 192.168.0.54:49155 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: >>>>> server >>>>> 192.168.0.52:49161 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: >>>>> server >>>>> 192.168.0.53:49159 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:25:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 4 10:25:01 s06 systemd: Starting User Slice of root. >>>>> Mar 4 10:25:01 s06 systemd: Started Session 201020 of user root. >>>>> Mar 4 10:25:01 s06 systemd: Starting Session 201020 of user root. >>>>> Mar 4 10:25:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 4 10:25:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 4 10:25:57 s06 systemd: Created slice User Slice of root. >>>>> Mar 4 10:25:57 s06 systemd: Starting User Slice of root. >>>>> Mar 4 10:25:57 s06 systemd-logind: New session 201021 of user root. >>>>> Mar 4 10:25:57 s06 systemd: Started Session 201021 of user root. >>>>> Mar 4 10:25:57 s06 systemd: Starting Session 201021 of user root. >>>>> Mar 4 10:26:01 s06 systemd: Started Session 201022 of user root. >>>>> Mar 4 10:26:01 s06 systemd: Starting Session 201022 of user root. >>>>> Mar 4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL >>>>> handshake with 192.168.1.56: 5 >>>>> Mar 4 10:27:01 s06 systemd: Started Session 201023 of user root. >>>>> Mar 4 10:27:01 s06 systemd: Starting Session 201023 of user root. >>>>> Mar 4 10:28:01 s06 systemd: Started Session 201024 of user root. >>>>> Mar 4 10:28:01 s06 systemd: Starting Session 201024 of user root. >>>>> Mar 4 10:29:01 s06 systemd: Started Session 201025 of user root. >>>>> Mar 4 10:29:01 s06 systemd: Starting Session 201025 of user root. >>>>> >>>>> But, unfortunately, I don’t understand why it is happening. >>>>> Now, NAGIOS server shows that s06 status is ok: >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ****** Nagios *****Notification Type: RECOVERYService: Brick - >>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState: OKDate/Time: Mon Mar 4 >>>>> 10:35:23 CET 2019Additional Info:OK: Brick /gluster/mnt2/brick is up* >>>>> >>>>> Nothing is changed from RAM, CPUs, and NETWORK point of view. >>>>> /var/log/message file has been updated: >>>>> >>>>> Mar 4 10:32:01 s06 systemd: Starting Session 201029 of user root. >>>>> Mar 4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: >>>>> server >>>>> 192.168.0.52:49156 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: >>>>> server >>>>> 192.168.0.54:49167 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 4 10:33:01 s06 systemd: Started Session 201030 of user root. >>>>> Mar 4 10:33:01 s06 systemd: Starting Session 201030 of user root. >>>>> Mar 4 10:34:01 s06 systemd: Started Session 201031 of user root. >>>>> Mar 4 10:34:01 s06 systemd: Starting Session 201031 of user root. >>>>> Mar 4 10:35:01 s06 nrpe[162562]: Could not read request from client >>>>> 192.168.1.56, bailing out... >>>>> Mar 4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown. >>>>> Mar 4 10:35:01 s06 systemd: Started Session 201032 of user root. >>>>> Mar 4 10:35:01 s06 systemd: Starting Session 201032 of user root. >>>>> >>>>> Could you please help me to understand what it’s happening ? >>>>> Thank you in advance. >>>>> >>>>> Rergards, >>>>> Mauro >>>>> >>>>> >>>>> On 1 Mar 2019, at 12:17, Mauro Tridici <mauro.trid...@cmcc.it> wrote: >>>>> >>>>> >>>>> Thank you, Milind. >>>>> I executed the instructions you suggested: >>>>> >>>>> - grep “blocked for” /var/log/messages on s06 returns no output (no >>>>> “blocked” word is detected in messages file); >>>>> - in /var/log/messages file I can see this kind of error repeated for >>>>> a lot of times: >>>>> >>>>> Mar 1 08:43:01 s06 systemd: Starting Session 196071 of user root. >>>>> Mar 1 08:43:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 1 08:43:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general >>>>> protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in >>>>> libglusterfs.so.0.0.1[7f88e769b000+f7000] >>>>> Mar 1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user >>>>> 0 killed by SIGSEGV - dumping core >>>>> Mar 1 08:43:02 s06 abrt-server: Generating core_backtrace >>>>> Mar 1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': >>>>> No such file or directory >>>>> Mar 1 08:43:58 s06 abrt-server: Duplicate: UUID >>>>> Mar 1 08:43:58 s06 abrt-server: DUP_OF_DIR: >>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041 >>>>> Mar 1 08:43:58 s06 abrt-server: Deleting problem directory >>>>> ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041) >>>>> Mar 1 08:43:58 s06 dbus[1872]: [system] Activating service >>>>> name='org.freedesktop.problems' (using servicehelper) >>>>> Mar 1 08:43:58 s06 dbus[1872]: [system] Successfully activated >>>>> service 'org.freedesktop.problems' >>>>> Mar 1 08:43:58 s06 abrt-server: Generating core_backtrace >>>>> Mar 1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': >>>>> No such file or directory >>>>> Mar 1 08:43:58 s06 abrt-server: Cannot notify >>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event >>>>> 'report_uReport' exited with 1 >>>>> Mar 1 08:44:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 1 08:44:01 s06 systemd: Starting User Slice of root. >>>>> Mar 1 08:44:01 s06 systemd: Started Session 196072 of user root. >>>>> Mar 1 08:44:01 s06 systemd: Starting Session 196072 of user root. >>>>> Mar 1 08:44:01 s06 systemd: Removed slice User Slice of root. >>>>> >>>>> - in /var/log/messages file I can see also 4 errors related to other >>>>> cluster servers: >>>>> >>>>> Mar 1 11:05:01 s06 systemd: Starting User Slice of root. >>>>> Mar 1 11:05:01 s06 systemd: Started Session 196230 of user root. >>>>> Mar 1 11:05:01 s06 systemd: Starting Session 196230 of user root. >>>>> Mar 1 11:05:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 1 11:05:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: >>>>> server >>>>> 192.168.0.51:49163 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 1 11:06:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 1 11:06:01 s06 systemd: Starting User Slice of root. >>>>> Mar 1 11:06:01 s06 systemd: Started Session 196231 of user root. >>>>> Mar 1 11:06:01 s06 systemd: Starting Session 196231 of user root. >>>>> Mar 1 11:06:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 1 11:06:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: server >>>>> 192.168.0.52:49153 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: server >>>>> 192.168.0.52:49155 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 1 11:07:01 s06 systemd: Created slice User Slice of root. >>>>> Mar 1 11:07:01 s06 systemd: Starting User Slice of root. >>>>> Mar 1 11:07:01 s06 systemd: Started Session 196232 of user root. >>>>> Mar 1 11:07:01 s06 systemd: Starting Session 196232 of user root. >>>>> Mar 1 11:07:01 s06 systemd: Removed slice User Slice of root. >>>>> Mar 1 11:07:01 s06 systemd: Stopping User Slice of root. >>>>> Mar 1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C >>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: server >>>>> 192.168.0.51:49152 has not responded in the last 42 seconds, >>>>> disconnecting. >>>>> Mar 1 11:08:01 s06 systemd: Created slice User Slice of root. >>>>> >>>>> No “blocked” word is in /var/log/messages files on other cluster >>>>> servers. >>>>> In attachment, the /var/log/messages file from s06 server. >>>>> >>>>> Thank you in advance, >>>>> Mauro >>>>> >>>>> <messages.zip> >>>>> >>>>> >>>>> On 1 Mar 2019, at 11:47, Milind Changire <mchan...@redhat.com> wrote: >>>>> >>>>> The traces of very high disk activity on the servers are often found >>>>> in /var/log/messages >>>>> You might want to grep for "blocked for" in /var/log/messages on s06 >>>>> and correlate the timestamps to confirm the unresponsiveness as reported >>>>> in >>>>> gluster client logs. >>>>> In cases of high disk activity, although the operating system >>>>> continues to respond to ICMP pings, the processes writing to disks often >>>>> get blocked to a large flush to the disk which could span beyond 42 >>>>> seconds >>>>> and hence result in ping-timer-expiry logs. >>>>> >>>>> As a side note: >>>>> If you indeed find gluster processes being blocked in >>>>> /var/log/messages, you might want to tweak sysctl tunables called >>>>> vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller value >>>>> than the existing. Please read up more on those tunables before touching >>>>> the settings. >>>>> >>>>> >>>>> On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <mauro.trid...@cmcc.it> >>>>> wrote: >>>>> >>>>>> >>>>>> Hi all, >>>>>> >>>>>> in attachment the client log captured after changing >>>>>> network.ping-timeout option. >>>>>> I noticed this error involving server 192.168.0.56 (s06) >>>>>> >>>>>> [2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] >>>>>> 0-tier2-client-71: changing ping timeout to 42 (from 0) >>>>>> [2019-03-01 09:23:36.078213] I >>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>> volfile,continuing >>>>>> [2019-03-01 09:23:36.078432] I >>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>> volfile,continuing >>>>>> [2019-03-01 09:23:36.092357] I >>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>> volfile,continuing >>>>>> [2019-03-01 09:23:36.094146] I >>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>> volfile,continuing >>>>>> [2019-03-01 10:06:24.708082] C >>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: >>>>>> server >>>>>> 192.168.0.56:49156 has not responded in the last 42 seconds, >>>>>> disconnecting. >>>>>> >>>>>> I don’t know why it happens, s06 server seems to be reachable. >>>>>> >>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156 >>>>>> Trying 192.168.0.56... >>>>>> Connected to 192.168.0.56. >>>>>> Escape character is '^]'. >>>>>> ^CConnection closed by foreign host. >>>>>> [athena_login2][/users/home/sysm02/]> ping 192.168.0.56 >>>>>> PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data. >>>>>> 64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms >>>>>> 64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms >>>>>> >>>>>> --- 192.168.0.56 ping statistics --- >>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 1528ms >>>>>> rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms >>>>>> >>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156 >>>>>> Trying 192.168.0.56... >>>>>> Connected to 192.168.0.56. >>>>>> Escape character is '^]'. >>>>>> >>>>>> Thank you for your help, >>>>>> Mauro >>>>>> >>>>>> >>>>>> >>>>>> On 1 Mar 2019, at 10:29, Mauro Tridici <mauro.trid...@cmcc.it> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> thank you for the explanation. >>>>>> I just changed network.ping-timeout option to default value >>>>>> (network.ping-timeout=42). >>>>>> >>>>>> I will check the logs to see if the errors will appear again. >>>>>> >>>>>> Regards, >>>>>> Mauro >>>>>> >>>>>> On 1 Mar 2019, at 04:43, Milind Changire <mchan...@redhat.com> wrote: >>>>>> >>>>>> network.ping-timeout should not be set to zero for non-glusterd >>>>>> clients. >>>>>> glusterd is a special case for which ping-timeout is set to zero via >>>>>> /etc/glusterfs/glusterd.vol >>>>>> >>>>>> Setting network.ping-timeout to zero disables arming of the ping >>>>>> timer for connections. This disables testing the connection for >>>>>> responsiveness and hence avoids proactive fail-over. >>>>>> >>>>>> Please reset network.ping-timeout to a non-zero positive value, eg. 42 >>>>>> >>>>>> >>>>>> On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran < >>>>>> nbala...@redhat.com> wrote: >>>>>> >>>>>>> Adding Raghavendra and Milind to comment on this. >>>>>>> >>>>>>> What is the effect of setting network.ping-timeout to 0 and should >>>>>>> it be set back to 42? >>>>>>> Regards, >>>>>>> Nithya >>>>>>> >>>>>>> On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <mauro.trid...@cmcc.it> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Nithya, >>>>>>>> >>>>>>>> sorry for the late. >>>>>>>> network.ping-timeout has been set to 0 in order to try to solve >>>>>>>> some timeout problems, but it didn’t help. >>>>>>>> I can set it to the default value. >>>>>>>> >>>>>>>> Can I proceed with the change? >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Mauro >>>>>>>> >>>>>>>> >>>>>>>> On 28 Feb 2019, at 04:41, Nithya Balachandran <nbala...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Mauro, >>>>>>>> >>>>>>>> Is network.ping-timeout still set to 0. The default value is 42. Is >>>>>>>> there a particular reason why this was changed? >>>>>>>> >>>>>>>> Regards, >>>>>>>> Nithya >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <mauro.trid...@cmcc.it> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Hi Xavi, >>>>>>>>> >>>>>>>>> thank you for the detailed explanation and suggestions. >>>>>>>>> Yes, transport.listen-backlog option is still set to 1024. >>>>>>>>> >>>>>>>>> I will check the network and connectivity status using “ping” and >>>>>>>>> “telnet” as soon as the errors will come back again. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Mauro >>>>>>>>> >>>>>>>>> Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez < >>>>>>>>> jaher...@redhat.com> ha scritto: >>>>>>>>> >>>>>>>>> Hi Mauro, >>>>>>>>> >>>>>>>>> those errors say that the mount point is not connected to some of >>>>>>>>> the bricks while executing operations. I see references to 3rd and 6th >>>>>>>>> bricks of several disperse sets, which seem to map to server s06. For >>>>>>>>> some >>>>>>>>> reason, gluster is having troubles connecting from the client machine >>>>>>>>> to >>>>>>>>> that particular server. At the end of the log I see that after long >>>>>>>>> time a >>>>>>>>> reconnect is done to both of them. However little after, other bricks >>>>>>>>> from >>>>>>>>> the s05 get disconnected and a reconnect times out. >>>>>>>>> >>>>>>>>> That's really odd. It seems like if server/communication is cut to >>>>>>>>> s06 for some time, then restored, and then the same happens to the >>>>>>>>> next >>>>>>>>> server. >>>>>>>>> >>>>>>>>> If the servers are really online and it's only a communication >>>>>>>>> issue, it explains why server memory and network has increased: if the >>>>>>>>> problem only exists between the client and servers, any write made by >>>>>>>>> the >>>>>>>>> client will automatically mark the file as damaged, since some of the >>>>>>>>> servers have not been updated. Since self-heal runs from the server >>>>>>>>> nodes, >>>>>>>>> they will probably be correctly connected to all bricks, which allows >>>>>>>>> them >>>>>>>>> to heal the just damaged file, which increases memory and network >>>>>>>>> usage. >>>>>>>>> >>>>>>>>> I guess you still have transport.listen-backlog set to 1024, right >>>>>>>>> ? >>>>>>>>> >>>>>>>>> Just to try to identify if the problem really comes from network, >>>>>>>>> can you check if you lose some pings from the client to all of the >>>>>>>>> servers >>>>>>>>> while you are seeing those errors in the log file ? >>>>>>>>> >>>>>>>>> You can also check if during those errors, you can telnet to the >>>>>>>>> port of the brick from the client. >>>>>>>>> >>>>>>>>> Xavi >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici < >>>>>>>>> mauro.trid...@cmcc.it> wrote: >>>>>>>>> >>>>>>>>>> Hi Nithya, >>>>>>>>>> >>>>>>>>>> “df -h” operation is not still slow, but no users are using the >>>>>>>>>> volume, RAM and NETWORK usage is ok on the client node. >>>>>>>>>> >>>>>>>>>> I was worried about this kind of warnings/errors: >>>>>>>>>> >>>>>>>>>> [2019-02-25 10:59:00.664323] W [MSGID: 122035] >>>>>>>>>> [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing >>>>>>>>>> operation >>>>>>>>>> with some subvolumes unavailable (20) >>>>>>>>>> >>>>>>>>>> [2019-02-26 03:11:35.212603] E >>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] >>>>>>>>>> (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] >>>>>>>>>> (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] >>>>>>>>>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] >>>>>>>>>> ))))) >>>>>>>>>> 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) >>>>>>>>>> op(WRITE(13)) >>>>>>>>>> called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5) >>>>>>>>>> >>>>>>>>>> [2019-02-26 03:13:03.313831] E >>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection >>>>>>>>>> to >>>>>>>>>> 192.168.0.56:49156 failed (Timeout della connessione); >>>>>>>>>> disconnecting socket >>>>>>>>>> >>>>>>>>>> It seems that some subvolumes are not available and 192.168.0.56 >>>>>>>>>> server (s06) is not reachable. >>>>>>>>>> But gluster servers are up&running and bricks are ok. >>>>>>>>>> >>>>>>>>>> In attachment the updated tier2.log file. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thank you. >>>>>>>>>> Regards, >>>>>>>>>> Mauro >>>>>>>>>> >>>>>>>>>> Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran < >>>>>>>>>> nbala...@redhat.com> ha scritto: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I see a lot of EC messages in the log but they don't seem very >>>>>>>>>> serious. Xavi, can you take a look? >>>>>>>>>> >>>>>>>>>> The only errors I see are: >>>>>>>>>> [2019-02-25 10:58:45.519871] E >>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] >>>>>>>>>> (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] >>>>>>>>>> (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] >>>>>>>>>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] >>>>>>>>>> ))))) >>>>>>>>>> 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) >>>>>>>>>> op(WRITE(13)) >>>>>>>>>> called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7) >>>>>>>>>> [2019-02-25 10:58:51.461493] E >>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] >>>>>>>>>> (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] >>>>>>>>>> (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] (--> >>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] >>>>>>>>>> (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] >>>>>>>>>> ))))) >>>>>>>>>> 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) >>>>>>>>>> op(WRITE(13)) >>>>>>>>>> called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a) >>>>>>>>>> [2019-02-25 11:07:57.152874] E >>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-70: connection >>>>>>>>>> to >>>>>>>>>> 192.168.0.55:49163 failed (Timeout della connessione); >>>>>>>>>> disconnecting socket >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Is the df -h operation still slow? If yes, can you take a tcpdump >>>>>>>>>> of the client while running df -h and send that across? >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> Nithya >>>>>>>>>> >>>>>>>>>> On Mon, 25 Feb 2019 at 17:27, Mauro Tridici < >>>>>>>>>> mauro.trid...@cmcc.it> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Sorry, some minutes after my last mail message, I noticed that >>>>>>>>>>> “df -h” command hanged for a while before returns the prompt. >>>>>>>>>>> Yesterday, everything was ok in the gluster client log, but, >>>>>>>>>>> today, I see a lot of errors (please, take a look to the attached >>>>>>>>>>> file). >>>>>>>>>>> >>>>>>>>>>> On the client node, I detected an important RAM e NETWORK usage. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Do you think that the errors have been caused by the client >>>>>>>>>>> resources usage? >>>>>>>>>>> >>>>>>>>>>> Thank you in advance, >>>>>>>>>>> Mauro >>>>>>>>>>> >>>>>>>>>>> >>>>> >> >> > >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel