On Thu, Mar 21, 2019 at 4:10 PM Mauro Tridici <mauro.trid...@cmcc.it> wrote:
> Hi Raghavendra, > > the number of errors reduced, but during last days I received some error > notifications from Nagios server similar to the following one: > > > > > > > > > > > > > > > ****** Nagios *****Notification Type: PROBLEMService: Brick - > /gluster/mnt5/brickHost: s04Address: s04-stgState: CRITICALDate/Time: Mon > Mar 18 19:56:36 CET 2019Additional Info:CHECK_NRPE STATE CRITICAL: Socket > timeout after 10 seconds.* > > The error was related only to s04 gluster server. > > So, following your suggestions, I executed, on s04 node, the top command. > In attachment, you can find the related output. > top output doesn't contain cmd/thread names. Was there anything wrong. > Thank you very much for your help. > Regards, > Mauro > > > > On 14 Mar 2019, at 13:31, Raghavendra Gowdappa <rgowd...@redhat.com> > wrote: > > Thanks Mauro. > > On Thu, Mar 14, 2019 at 3:38 PM Mauro Tridici <mauro.trid...@cmcc.it> > wrote: > >> Hi Raghavendra, >> >> I just changed the client option value to 8. >> I will check the volume behaviour during the next hours. >> >> The GlusterFS version is 3.12.14. >> >> I will provide you the logs as soon as the activity load will be high. >> Thank you, >> Mauro >> >> On 14 Mar 2019, at 04:57, Raghavendra Gowdappa <rgowd...@redhat.com> >> wrote: >> >> >> >> On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <mauro.trid...@cmcc.it> >> wrote: >> >>> Hi Raghavendra, >>> >>> Yes, server.event-thread has been changed from 4 to 8. >>> >> >> Was client.event-thread value too changed to 8? If not, I would like to >> know the results of including this tuning too. Also, if possible, can you >> get the output of following command from problematic clients and bricks >> (during the duration when load tends to be high and ping-timer-expiry is >> seen)? >> >> # top -bHd 3 >> >> This will help us to know CPU utilization of event-threads. >> >> And I forgot to ask, what version of Glusterfs are you using? >> >> During last days, I noticed that the error events are still here although >>> they have been considerably reduced. >>> >>> So, I used grep command against the log files in order to provide you a >>> global vision about the warning, error and critical events appeared today >>> at 06:xx (may be useful I hope). >>> I collected the info from s06 gluster server, but the behaviour is the >>> the almost the same on the other gluster servers. >>> >>> *ERRORS: * >>> *CWD: /var/log/glusterfs * >>> *COMMAND: grep " E " *.log |grep "2019-03-13 06:"* >>> >>> (I can see a lot of this kind of message in the same period but I'm >>> notifying you only one record for each type of error) >>> >>> glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] >>> [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of >>> /var/run/gluster/tier2_quota_list/ >>> >>> glustershd.log:[2019-03-13 06:14:28.666562] E >>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> >>> /lib64/libgfr >>> pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> >>> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> >>> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup >>> +0x90)[0x7f4a71ba3640] (--> >>> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) >>> 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) >>> op(INODELK(29)) >>> called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) >>> >>> glustershd.log:[2019-03-13 06:17:48.883825] E >>> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to >>> 192.168.0.55:49158 failed (Connection timed out); disco >>> nnecting socket >>> glustershd.log:[2019-03-13 06:19:58.931798] E >>> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to >>> 192.168.0.55:49158 failed (Connection timed out); disco >>> nnecting socket >>> glustershd.log:[2019-03-13 06:22:08.979829] E >>> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to >>> 192.168.0.55:49158 failed (Connection timed out); disco >>> nnecting socket >>> glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] >>> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote >>> operation failed [Transport endpoint >>> is not connected] >>> glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] >>> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote >>> operation failed [Transport endpoint >>> is not connected] >>> glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] >>> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote >>> operation failed [Transport endpoint >>> is not connected] >>> >>> *WARNINGS:* >>> *CWD: /var/log/glusterfs * >>> *COMMAND: grep " W " *.log |grep "2019-03-13 06:"* >>> >>> (I can see a lot of this kind of message in the same period but I'm >>> notifying you only one record for each type of warnings) >>> >>> glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] >>> [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote >>> operation failed. Path: <gfid:b6b35d >>> 0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). >>> Key: (null) [Transport endpoint is not connected] >>> >>> glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] >>> [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation >>> with some subvolumes unavailable (2) >>> >>> glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] >>> [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get >>> index-dir on tier2-client-55 [Operation >>> now in progress] >>> >>> quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] >>> [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is >>> deprecated, preferred is 'trans >>> port.address-family', continuing with correction >>> quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] >>> [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option >>> 'parallel-readdir' is not recognized >>> quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W >>> [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) >>> [0x7f340892be25] -->/usr/sbin/glusterfs(gluste >>> rfs_sigwaiter+0xe5) [0x55ef010164b5] >>> -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: >>> received signum (15), shutting down >>> >>> *CRITICALS:* >>> *CWD: /var/log/glusterfs * >>> *COMMAND: grep " C " *.log |grep "2019-03-13 06:"* >>> >>> no critical errors at 06:xx >>> only one critical error during the day >>> >>> *[root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"* >>> glustershd.log:[2019-03-13 02:21:29.126279] C >>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server >>> 192.168.0.55:49158 has not responded in the last 42 seconds, >>> disconnecting. >>> >>> >>> Thank you very much for your help. >>> Regards, >>> Mauro >>> >>> On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <rgowd...@redhat.com> >>> wrote: >>> >>> Was the suggestion to increase server.event-thread values tried? If yes, >>> what were the results? >>> >>> On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <mauro.trid...@cmcc.it> >>> wrote: >>> >>>> Dear All, >>>> >>>> do you have any suggestions about the right way to "debug" this issue? >>>> In attachment, the updated logs of “s06" gluster server. >>>> >>>> I noticed a lot of intermittent warning and error messages. >>>> >>>> Thank you in advance, >>>> Mauro >>>> >>>> >>>> >>>> On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <rgowd...@redhat.com> >>>> wrote: >>>> >>>> >>>> +Gluster Devel <gluster-devel@gluster.org>, +Gluster-users >>>> <gluster-us...@gluster.org> >>>> >>>> I would like to point out another issue. Even if what I suggested >>>> prevents disconnects, part of the solution would be only symptomatic >>>> treatment and doesn't address the root cause of the problem. In most of the >>>> ping-timer-expiry issues, the root cause is the increased load on bricks >>>> and the inability of bricks to be responsive under high load. So, the >>>> actual solution would be doing any or both of the following: >>>> * identify the source of increased load and if possible throttle it. >>>> Internal heal processes like self-heal, rebalance, quota heal are known to >>>> pump traffic into bricks without much throttling (io-threads _might_ do >>>> some throttling, but my understanding is its not sufficient). >>>> * identify the reason for bricks to become unresponsive during load. >>>> This may be fixable issues like not enough event-threads to read from >>>> network or difficult to fix issues like fsync on backend fs freezing the >>>> process or semi fixable issues (in code) like lock contention. >>>> >>>> So any genuine effort to fix ping-timer-issues (to be honest most of >>>> the times they are not issues related to rpc/network) would involve >>>> performance characterization of various subsystems on bricks and clients. >>>> Various subsystems can include (but not necessarily limited to), underlying >>>> OS/filesystem, glusterfs processes, CPU consumption etc >>>> >>>> regards, >>>> Raghavendra >>>> >>>> On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <mauro.trid...@cmcc.it> >>>> wrote: >>>> >>>>> Thank you, let’s try! >>>>> I will inform you about the effects of the change. >>>>> >>>>> Regards, >>>>> Mauro >>>>> >>>>> On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <rgowd...@redhat.com> >>>>> wrote: >>>>> >>>>> >>>>> >>>>> On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <mauro.trid...@cmcc.it> >>>>> wrote: >>>>> >>>>>> Hi Raghavendra, >>>>>> >>>>>> thank you for your reply. >>>>>> Yes, you are right. It is a problem that seems to happen randomly. >>>>>> At this moment, server.event-threads value is 4. I will try to >>>>>> increase this value to 8. Do you think that it could be a valid value ? >>>>>> >>>>> >>>>> Yes. We can try with that. You should see at least frequency of >>>>> ping-timer related disconnects reduce with this value (even if it doesn't >>>>> eliminate the problem completely). >>>>> >>>>> >>>>>> Regards, >>>>>> Mauro >>>>>> >>>>>> >>>>>> On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <rgowd...@redhat.com> >>>>>> wrote: >>>>>> >>>>>> >>>>>> >>>>>> On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran < >>>>>> nbala...@redhat.com> wrote: >>>>>> >>>>>>> Hi Mauro, >>>>>>> >>>>>>> It looks like some problem on s06. Are all your other nodes ok? Can >>>>>>> you send us the gluster logs from this node? >>>>>>> >>>>>>> @Raghavendra G <raghaven...@gluster.com> , do you have any idea as >>>>>>> to how this can be debugged? Maybe running top ? Or debug brick logs? >>>>>>> >>>>>> >>>>>> If we can reproduce the problem, collecting tcpdump on both ends of >>>>>> connection will help. But, one common problem is these bugs are >>>>>> inconsistently reproducible and hence we may not be able to capture >>>>>> tcpdump >>>>>> at correct intervals. Other than that, we can try to collect some >>>>>> evidence >>>>>> that poller threads were busy (waiting on locks). But, not sure what >>>>>> debug >>>>>> data provides that information. >>>>>> >>>>>> From what I know, its difficult to collect evidence for this issue >>>>>> and we could only reason about it. >>>>>> >>>>>> We can try a workaround though - try increasing server.event-threads >>>>>> and see whether ping-timer expiry issues go away with an optimal value. >>>>>> If >>>>>> that's the case, it kind of provides proof for our hypothesis. >>>>>> >>>>>> >>>>>>> >>>>>>> Regards, >>>>>>> Nithya >>>>>>> >>>>>>> On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <mauro.trid...@cmcc.it> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> some minutes ago I received this message from NAGIOS server >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ****** Nagios *****Notification Type: PROBLEMService: Brick - >>>>>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState: CRITICALDate/Time: >>>>>>>> Mon >>>>>>>> Mar 4 10:25:33 CET 2019Additional Info:CHECK_NRPE STATE CRITICAL: >>>>>>>> Socket >>>>>>>> timeout after 10 seconds.* >>>>>>>> >>>>>>>> I checked the network, RAM and CPUs usage on s06 node and >>>>>>>> everything seems to be ok. >>>>>>>> No bricks are in error state. In /var/log/messages, I detected >>>>>>>> again a crash of “check_vol_utili” that I think it is a module used by >>>>>>>> NRPE >>>>>>>> executable (that is the NAGIOS client). >>>>>>>> >>>>>>>> Mar 4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general >>>>>>>> protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in >>>>>>>> libglusterfs.so.0.0.1[7facff9b7000+f7000] >>>>>>>> Mar 4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of >>>>>>>> user 0 killed by SIGSEGV - dumping core >>>>>>>> Mar 4 10:15:29 s06 abrt-server: Generating core_backtrace >>>>>>>> Mar 4 10:15:29 s06 abrt-server: Error: Unable to open >>>>>>>> './coredump': No such file or directory >>>>>>>> Mar 4 10:16:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 4 10:16:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 4 10:16:01 s06 systemd: Started Session 201010 of user root. >>>>>>>> Mar 4 10:16:01 s06 systemd: Starting Session 201010 of user root. >>>>>>>> Mar 4 10:16:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 4 10:16:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 4 10:16:24 s06 abrt-server: Duplicate: UUID >>>>>>>> Mar 4 10:16:24 s06 abrt-server: DUP_OF_DIR: >>>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041 >>>>>>>> Mar 4 10:16:24 s06 abrt-server: Deleting problem directory >>>>>>>> ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041) >>>>>>>> Mar 4 10:16:24 s06 abrt-server: Generating core_backtrace >>>>>>>> Mar 4 10:16:24 s06 abrt-server: Error: Unable to open >>>>>>>> './coredump': No such file or directory >>>>>>>> Mar 4 10:16:24 s06 abrt-server: Cannot notify >>>>>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event >>>>>>>> 'report_uReport' exited with 1 >>>>>>>> Mar 4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of >>>>>>>> user 0 killed by SIGABRT - dumping core >>>>>>>> Mar 4 10:16:25 s06 abrt-server: Generating core_backtrace >>>>>>>> Mar 4 10:16:25 s06 abrt-server: Error: Unable to open >>>>>>>> './coredump': No such file or directory >>>>>>>> Mar 4 10:17:01 s06 systemd: Created slice User Slice of root. >>>>>>>> >>>>>>>> Also, I noticed the following errors that I think are very critical: >>>>>>>> >>>>>>>> Mar 4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: >>>>>>>> server 192.168.0.55:49158 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:22:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 4 10:22:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 4 10:22:01 s06 systemd: Started Session 201017 of user root. >>>>>>>> Mar 4 10:22:01 s06 systemd: Starting Session 201017 of user root. >>>>>>>> Mar 4 10:22:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 4 10:22:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: >>>>>>>> server 192.168.0.54:49165 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:23:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 4 10:23:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 4 10:23:01 s06 systemd: Started Session 201018 of user root. >>>>>>>> Mar 4 10:23:01 s06 systemd: Starting Session 201018 of user root. >>>>>>>> Mar 4 10:23:02 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 4 10:23:02 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 4 10:24:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 4 10:24:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 4 10:24:01 s06 systemd: Started Session 201019 of user root. >>>>>>>> Mar 4 10:24:01 s06 systemd: Starting Session 201019 of user root. >>>>>>>> Mar 4 10:24:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 4 10:24:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: >>>>>>>> server 192.168.0.52:49158 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C >>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: >>>>>>>> server >>>>>>>> 192.168.0.51:49153 has not responded in the last 42 seconds, >>>>>>>> disconnecting. >>>>>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C >>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: >>>>>>>> server >>>>>>>> 192.168.0.52:49156 has not responded in the last 42 seconds, >>>>>>>> disconnecting. >>>>>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C >>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: >>>>>>>> server >>>>>>>> 192.168.0.51:49159 has not responded in the last 42 seconds, >>>>>>>> disconnecting. >>>>>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C >>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: >>>>>>>> server >>>>>>>> 192.168.0.52:49161 has not responded in the last 42 seconds, >>>>>>>> disconnecting. >>>>>>>> Mar 4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C >>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: >>>>>>>> server >>>>>>>> 192.168.0.54:49165 has not responded in the last 42 seconds, >>>>>>>> disconnecting. >>>>>>>> Mar 4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: >>>>>>>> server 192.168.0.54:49155 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: >>>>>>>> server 192.168.0.52:49161 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: >>>>>>>> server 192.168.0.53:49159 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:25:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 4 10:25:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 4 10:25:01 s06 systemd: Started Session 201020 of user root. >>>>>>>> Mar 4 10:25:01 s06 systemd: Starting Session 201020 of user root. >>>>>>>> Mar 4 10:25:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 4 10:25:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 4 10:25:57 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 4 10:25:57 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 4 10:25:57 s06 systemd-logind: New session 201021 of user root. >>>>>>>> Mar 4 10:25:57 s06 systemd: Started Session 201021 of user root. >>>>>>>> Mar 4 10:25:57 s06 systemd: Starting Session 201021 of user root. >>>>>>>> Mar 4 10:26:01 s06 systemd: Started Session 201022 of user root. >>>>>>>> Mar 4 10:26:01 s06 systemd: Starting Session 201022 of user root. >>>>>>>> Mar 4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL >>>>>>>> handshake with 192.168.1.56: 5 >>>>>>>> Mar 4 10:27:01 s06 systemd: Started Session 201023 of user root. >>>>>>>> Mar 4 10:27:01 s06 systemd: Starting Session 201023 of user root. >>>>>>>> Mar 4 10:28:01 s06 systemd: Started Session 201024 of user root. >>>>>>>> Mar 4 10:28:01 s06 systemd: Starting Session 201024 of user root. >>>>>>>> Mar 4 10:29:01 s06 systemd: Started Session 201025 of user root. >>>>>>>> Mar 4 10:29:01 s06 systemd: Starting Session 201025 of user root. >>>>>>>> >>>>>>>> But, unfortunately, I don’t understand why it is happening. >>>>>>>> Now, NAGIOS server shows that s06 status is ok: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ****** Nagios *****Notification Type: RECOVERYService: Brick - >>>>>>>> /gluster/mnt2/brickHost: s06Address: s06-stgState: OKDate/Time: Mon >>>>>>>> Mar 4 >>>>>>>> 10:35:23 CET 2019Additional Info:OK: Brick /gluster/mnt2/brick is up* >>>>>>>> >>>>>>>> Nothing is changed from RAM, CPUs, and NETWORK point of view. >>>>>>>> /var/log/message file has been updated: >>>>>>>> >>>>>>>> Mar 4 10:32:01 s06 systemd: Starting Session 201029 of user root. >>>>>>>> Mar 4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: >>>>>>>> server 192.168.0.52:49156 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: >>>>>>>> server 192.168.0.54:49167 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 4 10:33:01 s06 systemd: Started Session 201030 of user root. >>>>>>>> Mar 4 10:33:01 s06 systemd: Starting Session 201030 of user root. >>>>>>>> Mar 4 10:34:01 s06 systemd: Started Session 201031 of user root. >>>>>>>> Mar 4 10:34:01 s06 systemd: Starting Session 201031 of user root. >>>>>>>> Mar 4 10:35:01 s06 nrpe[162562]: Could not read request from >>>>>>>> client 192.168.1.56, bailing out... >>>>>>>> Mar 4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown. >>>>>>>> Mar 4 10:35:01 s06 systemd: Started Session 201032 of user root. >>>>>>>> Mar 4 10:35:01 s06 systemd: Starting Session 201032 of user root. >>>>>>>> >>>>>>>> Could you please help me to understand what it’s happening ? >>>>>>>> Thank you in advance. >>>>>>>> >>>>>>>> Rergards, >>>>>>>> Mauro >>>>>>>> >>>>>>>> >>>>>>>> On 1 Mar 2019, at 12:17, Mauro Tridici <mauro.trid...@cmcc.it> >>>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> Thank you, Milind. >>>>>>>> I executed the instructions you suggested: >>>>>>>> >>>>>>>> - grep “blocked for” /var/log/messages on s06 returns no output (no >>>>>>>> “blocked” word is detected in messages file); >>>>>>>> - in /var/log/messages file I can see this kind of error repeated >>>>>>>> for a lot of times: >>>>>>>> >>>>>>>> Mar 1 08:43:01 s06 systemd: Starting Session 196071 of user root. >>>>>>>> Mar 1 08:43:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 1 08:43:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general >>>>>>>> protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in >>>>>>>> libglusterfs.so.0.0.1[7f88e769b000+f7000] >>>>>>>> Mar 1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of >>>>>>>> user 0 killed by SIGSEGV - dumping core >>>>>>>> Mar 1 08:43:02 s06 abrt-server: Generating core_backtrace >>>>>>>> Mar 1 08:43:02 s06 abrt-server: Error: Unable to open >>>>>>>> './coredump': No such file or directory >>>>>>>> Mar 1 08:43:58 s06 abrt-server: Duplicate: UUID >>>>>>>> Mar 1 08:43:58 s06 abrt-server: DUP_OF_DIR: >>>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041 >>>>>>>> Mar 1 08:43:58 s06 abrt-server: Deleting problem directory >>>>>>>> ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041) >>>>>>>> Mar 1 08:43:58 s06 dbus[1872]: [system] Activating service >>>>>>>> name='org.freedesktop.problems' (using servicehelper) >>>>>>>> Mar 1 08:43:58 s06 dbus[1872]: [system] Successfully activated >>>>>>>> service 'org.freedesktop.problems' >>>>>>>> Mar 1 08:43:58 s06 abrt-server: Generating core_backtrace >>>>>>>> Mar 1 08:43:58 s06 abrt-server: Error: Unable to open >>>>>>>> './coredump': No such file or directory >>>>>>>> Mar 1 08:43:58 s06 abrt-server: Cannot notify >>>>>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event >>>>>>>> 'report_uReport' exited with 1 >>>>>>>> Mar 1 08:44:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 1 08:44:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 1 08:44:01 s06 systemd: Started Session 196072 of user root. >>>>>>>> Mar 1 08:44:01 s06 systemd: Starting Session 196072 of user root. >>>>>>>> Mar 1 08:44:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> >>>>>>>> - in /var/log/messages file I can see also 4 errors related to >>>>>>>> other cluster servers: >>>>>>>> >>>>>>>> Mar 1 11:05:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 1 11:05:01 s06 systemd: Started Session 196230 of user root. >>>>>>>> Mar 1 11:05:01 s06 systemd: Starting Session 196230 of user root. >>>>>>>> Mar 1 11:05:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 1 11:05:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: >>>>>>>> server 192.168.0.51:49163 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 1 11:06:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 1 11:06:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 1 11:06:01 s06 systemd: Started Session 196231 of user root. >>>>>>>> Mar 1 11:06:01 s06 systemd: Starting Session 196231 of user root. >>>>>>>> Mar 1 11:06:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 1 11:06:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: >>>>>>>> server 192.168.0.52:49153 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: >>>>>>>> server 192.168.0.52:49155 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 1 11:07:01 s06 systemd: Created slice User Slice of root. >>>>>>>> Mar 1 11:07:01 s06 systemd: Starting User Slice of root. >>>>>>>> Mar 1 11:07:01 s06 systemd: Started Session 196232 of user root. >>>>>>>> Mar 1 11:07:01 s06 systemd: Starting Session 196232 of user root. >>>>>>>> Mar 1 11:07:01 s06 systemd: Removed slice User Slice of root. >>>>>>>> Mar 1 11:07:01 s06 systemd: Stopping User Slice of root. >>>>>>>> Mar 1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] >>>>>>>> C [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: >>>>>>>> server 192.168.0.51:49152 has not responded in the last 42 >>>>>>>> seconds, disconnecting. >>>>>>>> Mar 1 11:08:01 s06 systemd: Created slice User Slice of root. >>>>>>>> >>>>>>>> No “blocked” word is in /var/log/messages files on other cluster >>>>>>>> servers. >>>>>>>> In attachment, the /var/log/messages file from s06 server. >>>>>>>> >>>>>>>> Thank you in advance, >>>>>>>> Mauro >>>>>>>> >>>>>>>> <messages.zip> >>>>>>>> >>>>>>>> >>>>>>>> On 1 Mar 2019, at 11:47, Milind Changire <mchan...@redhat.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> The traces of very high disk activity on the servers are often >>>>>>>> found in /var/log/messages >>>>>>>> You might want to grep for "blocked for" in /var/log/messages on >>>>>>>> s06 and correlate the timestamps to confirm the unresponsiveness as >>>>>>>> reported in gluster client logs. >>>>>>>> In cases of high disk activity, although the operating system >>>>>>>> continues to respond to ICMP pings, the processes writing to disks >>>>>>>> often >>>>>>>> get blocked to a large flush to the disk which could span beyond 42 >>>>>>>> seconds >>>>>>>> and hence result in ping-timer-expiry logs. >>>>>>>> >>>>>>>> As a side note: >>>>>>>> If you indeed find gluster processes being blocked in >>>>>>>> /var/log/messages, you might want to tweak sysctl tunables called >>>>>>>> vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller >>>>>>>> value >>>>>>>> than the existing. Please read up more on those tunables before >>>>>>>> touching >>>>>>>> the settings. >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <mauro.trid...@cmcc.it> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> in attachment the client log captured after changing >>>>>>>>> network.ping-timeout option. >>>>>>>>> I noticed this error involving server 192.168.0.56 (s06) >>>>>>>>> >>>>>>>>> [2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] >>>>>>>>> 0-tier2-client-71: changing ping timeout to 42 (from 0) >>>>>>>>> [2019-03-01 09:23:36.078213] I >>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>>>>> volfile,continuing >>>>>>>>> [2019-03-01 09:23:36.078432] I >>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>>>>> volfile,continuing >>>>>>>>> [2019-03-01 09:23:36.092357] I >>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>>>>> volfile,continuing >>>>>>>>> [2019-03-01 09:23:36.094146] I >>>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in >>>>>>>>> volfile,continuing >>>>>>>>> [2019-03-01 10:06:24.708082] C >>>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: >>>>>>>>> server >>>>>>>>> 192.168.0.56:49156 has not responded in the last 42 seconds, >>>>>>>>> disconnecting. >>>>>>>>> >>>>>>>>> I don’t know why it happens, s06 server seems to be reachable. >>>>>>>>> >>>>>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156 >>>>>>>>> Trying 192.168.0.56... >>>>>>>>> Connected to 192.168.0.56. >>>>>>>>> Escape character is '^]'. >>>>>>>>> ^CConnection closed by foreign host. >>>>>>>>> [athena_login2][/users/home/sysm02/]> ping 192.168.0.56 >>>>>>>>> PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data. >>>>>>>>> 64 bytes from 192.168.0.56: icmp_seq=1 ttl=64 time=0.116 ms >>>>>>>>> 64 bytes from 192.168.0.56: icmp_seq=2 ttl=64 time=0.101 ms >>>>>>>>> >>>>>>>>> --- 192.168.0.56 ping statistics --- >>>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 1528ms >>>>>>>>> rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms >>>>>>>>> >>>>>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156 >>>>>>>>> Trying 192.168.0.56... >>>>>>>>> Connected to 192.168.0.56. >>>>>>>>> Escape character is '^]'. >>>>>>>>> >>>>>>>>> Thank you for your help, >>>>>>>>> Mauro >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 1 Mar 2019, at 10:29, Mauro Tridici <mauro.trid...@cmcc.it> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> thank you for the explanation. >>>>>>>>> I just changed network.ping-timeout option to default value >>>>>>>>> (network.ping-timeout=42). >>>>>>>>> >>>>>>>>> I will check the logs to see if the errors will appear again. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Mauro >>>>>>>>> >>>>>>>>> On 1 Mar 2019, at 04:43, Milind Changire <mchan...@redhat.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> network.ping-timeout should not be set to zero for non-glusterd >>>>>>>>> clients. >>>>>>>>> glusterd is a special case for which ping-timeout is set to zero >>>>>>>>> via /etc/glusterfs/glusterd.vol >>>>>>>>> >>>>>>>>> Setting network.ping-timeout to zero disables arming of the ping >>>>>>>>> timer for connections. This disables testing the connection for >>>>>>>>> responsiveness and hence avoids proactive fail-over. >>>>>>>>> >>>>>>>>> Please reset network.ping-timeout to a non-zero positive value, >>>>>>>>> eg. 42 >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran < >>>>>>>>> nbala...@redhat.com> wrote: >>>>>>>>> >>>>>>>>>> Adding Raghavendra and Milind to comment on this. >>>>>>>>>> >>>>>>>>>> What is the effect of setting network.ping-timeout to 0 and >>>>>>>>>> should it be set back to 42? >>>>>>>>>> Regards, >>>>>>>>>> Nithya >>>>>>>>>> >>>>>>>>>> On Thu, 28 Feb 2019 at 16:01, Mauro Tridici < >>>>>>>>>> mauro.trid...@cmcc.it> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Nithya, >>>>>>>>>>> >>>>>>>>>>> sorry for the late. >>>>>>>>>>> network.ping-timeout has been set to 0 in order to try to solve >>>>>>>>>>> some timeout problems, but it didn’t help. >>>>>>>>>>> I can set it to the default value. >>>>>>>>>>> >>>>>>>>>>> Can I proceed with the change? >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> Mauro >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 28 Feb 2019, at 04:41, Nithya Balachandran < >>>>>>>>>>> nbala...@redhat.com> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Mauro, >>>>>>>>>>> >>>>>>>>>>> Is network.ping-timeout still set to 0. The default value is 42. >>>>>>>>>>> Is there a particular reason why this was changed? >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Nithya >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, 27 Feb 2019 at 21:32, Mauro Tridici < >>>>>>>>>>> mauro.trid...@cmcc.it> wrote: >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hi Xavi, >>>>>>>>>>>> >>>>>>>>>>>> thank you for the detailed explanation and suggestions. >>>>>>>>>>>> Yes, transport.listen-backlog option is still set to 1024. >>>>>>>>>>>> >>>>>>>>>>>> I will check the network and connectivity status using “ping” >>>>>>>>>>>> and “telnet” as soon as the errors will come back again. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Mauro >>>>>>>>>>>> >>>>>>>>>>>> Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez < >>>>>>>>>>>> jaher...@redhat.com> ha scritto: >>>>>>>>>>>> >>>>>>>>>>>> Hi Mauro, >>>>>>>>>>>> >>>>>>>>>>>> those errors say that the mount point is not connected to some >>>>>>>>>>>> of the bricks while executing operations. I see references to 3rd >>>>>>>>>>>> and 6th >>>>>>>>>>>> bricks of several disperse sets, which seem to map to server s06. >>>>>>>>>>>> For some >>>>>>>>>>>> reason, gluster is having troubles connecting from the client >>>>>>>>>>>> machine to >>>>>>>>>>>> that particular server. At the end of the log I see that after >>>>>>>>>>>> long time a >>>>>>>>>>>> reconnect is done to both of them. However little after, other >>>>>>>>>>>> bricks from >>>>>>>>>>>> the s05 get disconnected and a reconnect times out. >>>>>>>>>>>> >>>>>>>>>>>> That's really odd. It seems like if server/communication is cut >>>>>>>>>>>> to s06 for some time, then restored, and then the same happens to >>>>>>>>>>>> the next >>>>>>>>>>>> server. >>>>>>>>>>>> >>>>>>>>>>>> If the servers are really online and it's only a communication >>>>>>>>>>>> issue, it explains why server memory and network has increased: if >>>>>>>>>>>> the >>>>>>>>>>>> problem only exists between the client and servers, any write made >>>>>>>>>>>> by the >>>>>>>>>>>> client will automatically mark the file as damaged, since some of >>>>>>>>>>>> the >>>>>>>>>>>> servers have not been updated. Since self-heal runs from the >>>>>>>>>>>> server nodes, >>>>>>>>>>>> they will probably be correctly connected to all bricks, which >>>>>>>>>>>> allows them >>>>>>>>>>>> to heal the just damaged file, which increases memory and network >>>>>>>>>>>> usage. >>>>>>>>>>>> >>>>>>>>>>>> I guess you still have transport.listen-backlog set to 1024, >>>>>>>>>>>> right ? >>>>>>>>>>>> >>>>>>>>>>>> Just to try to identify if the problem really comes from >>>>>>>>>>>> network, can you check if you lose some pings from the client to >>>>>>>>>>>> all of the >>>>>>>>>>>> servers while you are seeing those errors in the log file ? >>>>>>>>>>>> >>>>>>>>>>>> You can also check if during those errors, you can telnet to >>>>>>>>>>>> the port of the brick from the client. >>>>>>>>>>>> >>>>>>>>>>>> Xavi >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici < >>>>>>>>>>>> mauro.trid...@cmcc.it> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Nithya, >>>>>>>>>>>>> >>>>>>>>>>>>> “df -h” operation is not still slow, but no users are using >>>>>>>>>>>>> the volume, RAM and NETWORK usage is ok on the client node. >>>>>>>>>>>>> >>>>>>>>>>>>> I was worried about this kind of warnings/errors: >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-02-25 10:59:00.664323] W [MSGID: 122035] >>>>>>>>>>>>> [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing >>>>>>>>>>>>> operation >>>>>>>>>>>>> with some subvolumes unavailable (20) >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-02-26 03:11:35.212603] E >>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] >>>>>>>>>>>>> ))))) >>>>>>>>>>>>> 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) >>>>>>>>>>>>> op(WRITE(13)) >>>>>>>>>>>>> called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5) >>>>>>>>>>>>> >>>>>>>>>>>>> [2019-02-26 03:13:03.313831] E >>>>>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-50: >>>>>>>>>>>>> connection to >>>>>>>>>>>>> 192.168.0.56:49156 failed (Timeout della connessione); >>>>>>>>>>>>> disconnecting socket >>>>>>>>>>>>> >>>>>>>>>>>>> It seems that some subvolumes are not available and >>>>>>>>>>>>> 192.168.0.56 server (s06) is not reachable. >>>>>>>>>>>>> But gluster servers are up&running and bricks are ok. >>>>>>>>>>>>> >>>>>>>>>>>>> In attachment the updated tier2.log file. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you. >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Mauro >>>>>>>>>>>>> >>>>>>>>>>>>> Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran < >>>>>>>>>>>>> nbala...@redhat.com> ha scritto: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I see a lot of EC messages in the log but they don't seem very >>>>>>>>>>>>> serious. Xavi, can you take a look? >>>>>>>>>>>>> >>>>>>>>>>>>> The only errors I see are: >>>>>>>>>>>>> [2019-02-25 10:58:45.519871] E >>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] >>>>>>>>>>>>> ))))) >>>>>>>>>>>>> 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) >>>>>>>>>>>>> op(WRITE(13)) >>>>>>>>>>>>> called at 2019-02-25 10:57:47.429969 (xid=0xd26fe7) >>>>>>>>>>>>> [2019-02-25 10:58:51.461493] E >>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> >>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45] >>>>>>>>>>>>> (--> >>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] >>>>>>>>>>>>> ))))) >>>>>>>>>>>>> 0-tier2-client-41: forced unwinding frame type(GlusterFS 3.3) >>>>>>>>>>>>> op(WRITE(13)) >>>>>>>>>>>>> called at 2019-02-25 10:57:47.499174 (xid=0xf47d6a) >>>>>>>>>>>>> [2019-02-25 11:07:57.152874] E >>>>>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-70: >>>>>>>>>>>>> connection to >>>>>>>>>>>>> 192.168.0.55:49163 failed (Timeout della connessione); >>>>>>>>>>>>> disconnecting socket >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Is the df -h operation still slow? If yes, can you take a >>>>>>>>>>>>> tcpdump of the client while running df -h and send that across? >>>>>>>>>>>>> >>>>>>>>>>>>> Regards, >>>>>>>>>>>>> Nithya >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, 25 Feb 2019 at 17:27, Mauro Tridici < >>>>>>>>>>>>> mauro.trid...@cmcc.it> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Sorry, some minutes after my last mail message, I noticed >>>>>>>>>>>>>> that “df -h” command hanged for a while before returns the >>>>>>>>>>>>>> prompt. >>>>>>>>>>>>>> Yesterday, everything was ok in the gluster client log, but, >>>>>>>>>>>>>> today, I see a lot of errors (please, take a look to the >>>>>>>>>>>>>> attached file). >>>>>>>>>>>>>> >>>>>>>>>>>>>> On the client node, I detected an important RAM e NETWORK >>>>>>>>>>>>>> usage. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Do you think that the errors have been caused by the client >>>>>>>>>>>>>> resources usage? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thank you in advance, >>>>>>>>>>>>>> Mauro >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>> >>>>> >>>>> >>>> >>>> >>> >>> >> >> > >
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-devel