Re: [Gluster-devel] [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Mauro Tridici Mon, 18 Mar 2019 13:21:40 -0700

Hi Raghavendra,

I just changed the client option value to 8.
I will check the volume behaviour during the next hours.


The GlusterFS version is 3.12.14.

I will provide you the logs as soon as the activity load will be high.
Thank you,
Mauro

> On 14 Mar 2019, at 04:57, Raghavendra Gowdappa <rgowd...@redhat.com> wrote:
> 
> 
> 
> On Wed, Mar 13, 2019 at 3:55 PM Mauro Tridici <mauro.trid...@cmcc.it 
> <mailto:mauro.trid...@cmcc.it>> wrote:
> Hi Raghavendra,
> 
> Yes, server.event-thread has been changed from 4 to 8.
> 
> Was client.event-thread value too changed to 8? If not, I would like to know 
> the results of including this tuning too. Also, if possible, can you get the 
> output of following command from problematic clients and bricks (during the 
> duration when load tends to be high and ping-timer-expiry is seen)?
> 
> # top -bHd 3
>  
> This will help us to know  CPU utilization of event-threads.
> 
> And I forgot to ask, what version of Glusterfs are you using?
> 
> During last days, I noticed that the error events are still here although 
> they have been considerably reduced.
> 
> So, I used grep command against the log files in order to provide you a 
> global vision about the warning, error and critical events appeared today at 
> 06:xx (may be useful I hope).
> I collected the info from s06 gluster server, but the behaviour is the the 
> almost the same on the other gluster servers.
> 
> ERRORS:  
> CWD: /var/log/glusterfs 
> COMMAND: grep " E " *.log |grep "2019-03-13 06:"
> 
> (I can see a lot of this kind of message in the same period but I'm notifying 
> you only one record for each type of error)
> 
> glusterd.log:[2019-03-13 06:12:35.982863] E [MSGID: 101042] 
> [compat.c:569:gf_umount_lazy] 0-management: Lazy unmount of 
> /var/run/gluster/tier2_quota_list/
> 
> glustershd.log:[2019-03-13 06:14:28.666562] E 
> [rpc-clnt.c:350:saved_frames_unwind] (--> 
> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x13b)[0x7f4a71ddcebb] (--> 
> /lib64/libgfr
> pc.so.0(saved_frames_unwind+0x1de)[0x7f4a71ba1d9e] (--> 
> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7f4a71ba1ebe] (--> 
> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup
> +0x90)[0x7f4a71ba3640] (--> 
> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7f4a71ba4130] ))))) 
> 0-tier2-client-55: forced unwinding frame type(GlusterFS 3.3) op(INODELK(29)) 
> called at 2019-03-13 06:14:14.858441 (xid=0x17fddb50) 
> 
> glustershd.log:[2019-03-13 06:17:48.883825] E 
> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 
> 192.168.0.55:49158 <http://192.168.0.55:49158/> failed (Connection timed 
> out); disco
> nnecting socket
> glustershd.log:[2019-03-13 06:19:58.931798] E 
> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 
> 192.168.0.55:49158 <http://192.168.0.55:49158/> failed (Connection timed 
> out); disco
> nnecting socket
> glustershd.log:[2019-03-13 06:22:08.979829] E 
> [socket.c:2376:socket_connect_finish] 0-tier2-client-55: connection to 
> 192.168.0.55:49158 <http://192.168.0.55:49158/> failed (Connection timed 
> out); disco
> nnecting socket
> glustershd.log:[2019-03-13 06:22:36.226847] E [MSGID: 114031] 
> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote 
> operation failed [Transport endpoint 
> is not connected]
> glustershd.log:[2019-03-13 06:22:36.306669] E [MSGID: 114031] 
> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote 
> operation failed [Transport endpoint 
> is not connected]
> glustershd.log:[2019-03-13 06:22:36.385257] E [MSGID: 114031] 
> [client-rpc-fops.c:1508:client3_3_inodelk_cbk] 0-tier2-client-55: remote 
> operation failed [Transport endpoint 
> is not connected]
> 
> WARNINGS:
> CWD: /var/log/glusterfs 
> COMMAND: grep " W " *.log |grep "2019-03-13 06:"
> 
> (I can see a lot of this kind of message in the same period but I'm notifying 
> you only one record for each type of warnings)
> 
> glustershd.log:[2019-03-13 06:14:28.666772] W [MSGID: 114031] 
> [client-rpc-fops.c:1080:client3_3_getxattr_cbk] 0-tier2-client-55: remote 
> operation failed. Path: <gfid:b6b35d
> 0f-f34d-4c25-bbe8-74bde0248d7e> (b6b35d0f-f34d-4c25-bbe8-74bde0248d7e). Key: 
> (null) [Transport endpoint is not connected]
> 
> glustershd.log:[2019-03-13 06:14:31.421576] W [MSGID: 122035] 
> [ec-common.c:571:ec_child_select] 0-tier2-disperse-9: Executing operation 
> with some subvolumes unavailable (2)
> 
> glustershd.log:[2019-03-13 06:15:31.547417] W [MSGID: 122032] 
> [ec-heald.c:266:ec_shd_index_sweep] 0-tier2-disperse-9: unable to get 
> index-dir on tier2-client-55 [Operation 
> now in progress]
> 
> quota-mount-tier2.log:[2019-03-13 06:12:36.116277] W [MSGID: 101002] 
> [options.c:995:xl_opt_validate] 0-glusterfs: option 'address-family' is 
> deprecated, preferred is 'trans
> port.address-family', continuing with correction
> quota-mount-tier2.log:[2019-03-13 06:12:36.198430] W [MSGID: 101174] 
> [graph.c:363:_log_if_unknown_option] 0-tier2-readdir-ahead: option 
> 'parallel-readdir' is not recognized
> quota-mount-tier2.log:[2019-03-13 06:12:37.945007] W 
> [glusterfsd.c:1375:cleanup_and_exit] (-->/lib64/libpthread.so.0(+0x7e25) 
> [0x7f340892be25] -->/usr/sbin/glusterfs(gluste
> rfs_sigwaiter+0xe5) [0x55ef010164b5] 
> -->/usr/sbin/glusterfs(cleanup_and_exit+0x6b) [0x55ef0101632b] ) 0-: received 
> signum (15), shutting down
> 
> CRITICALS:
> CWD: /var/log/glusterfs 
> COMMAND: grep " C " *.log |grep "2019-03-13 06:"
> 
> no critical errors at 06:xx
> only one critical error during the day
> 
> [root@s06 glusterfs]# grep " C " *.log |grep "2019-03-13"
> glustershd.log:[2019-03-13 02:21:29.126279] C 
> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: server 
> 192.168.0.55:49158 <http://192.168.0.55:49158/> has not responded in the last 
> 42 seconds, disconnecting.
> 
> 
> Thank you very much for your help.
> Regards,
> Mauro
> 
>> On 12 Mar 2019, at 05:17, Raghavendra Gowdappa <rgowd...@redhat.com 
>> <mailto:rgowd...@redhat.com>> wrote:
>> 
>> Was the suggestion to increase server.event-thread values tried? If yes, 
>> what were the results?
>> 
>> On Mon, Mar 11, 2019 at 2:40 PM Mauro Tridici <mauro.trid...@cmcc.it 
>> <mailto:mauro.trid...@cmcc.it>> wrote:
>> Dear All,
>> 
>> do you have any suggestions about the right way to "debug" this issue?
>> In attachment, the updated logs of “s06" gluster server.
>> 
>> I noticed a lot of intermittent warning and error messages.
>> 
>> Thank you in advance,
>> Mauro
>> 
>> 
>> 
>>> On 4 Mar 2019, at 18:45, Raghavendra Gowdappa <rgowd...@redhat.com 
>>> <mailto:rgowd...@redhat.com>> wrote:
>>> 
>>> 
>>> +Gluster Devel <mailto:gluster-devel@gluster.org>, +Gluster-users 
>>> <mailto:gluster-us...@gluster.org> 
>>> 
>>> I would like to point out another issue. Even if what I suggested prevents 
>>> disconnects, part of the solution would be only symptomatic treatment and 
>>> doesn't address the root cause of the problem. In most of the 
>>> ping-timer-expiry issues, the root cause is the increased load on bricks 
>>> and the inability of bricks to be responsive under high load. So, the 
>>> actual solution would be doing any or both of the following:
>>> * identify the source of increased load and if possible throttle it. 
>>> Internal heal processes like self-heal, rebalance, quota heal are known to 
>>> pump traffic into bricks without much throttling (io-threads _might_ do 
>>> some throttling, but my understanding is its not sufficient).
>>> * identify the reason for bricks to become unresponsive during load. This 
>>> may be fixable issues like not enough event-threads to read from network or 
>>> difficult to fix issues like fsync on backend fs freezing the process or 
>>> semi fixable issues (in code) like lock contention.
>>> 
>>> So any genuine effort to fix ping-timer-issues (to be honest most of the 
>>> times they are not issues related to rpc/network) would involve performance 
>>> characterization of various subsystems on bricks and clients. Various 
>>> subsystems can include (but not necessarily limited to), underlying 
>>> OS/filesystem, glusterfs processes, CPU consumption etc
>>> 
>>> regards,
>>> Raghavendra
>>> 
>>> On Mon, Mar 4, 2019 at 9:31 PM Mauro Tridici <mauro.trid...@cmcc.it 
>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>> Thank you, let’s try!
>>> I will inform you about the effects of the change.
>>> 
>>> Regards,
>>> Mauro
>>> 
>>>> On 4 Mar 2019, at 16:55, Raghavendra Gowdappa <rgowd...@redhat.com 
>>>> <mailto:rgowd...@redhat.com>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On Mon, Mar 4, 2019 at 8:54 PM Mauro Tridici <mauro.trid...@cmcc.it 
>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>> Hi Raghavendra,
>>>> 
>>>> thank you for your reply.
>>>> Yes, you are right. It is a problem that seems to happen randomly.
>>>> At this moment, server.event-threads value is 4. I will try to increase 
>>>> this value to 8. Do you think that it could be a valid value ?
>>>> 
>>>> Yes. We can try with that. You should see at least frequency of ping-timer 
>>>> related disconnects  reduce with this value (even if it doesn't eliminate 
>>>> the problem completely).
>>>> 
>>>> 
>>>> Regards,
>>>> Mauro
>>>> 
>>>> 
>>>>> On 4 Mar 2019, at 15:36, Raghavendra Gowdappa <rgowd...@redhat.com 
>>>>> <mailto:rgowd...@redhat.com>> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Mon, Mar 4, 2019 at 8:01 PM Nithya Balachandran <nbala...@redhat.com 
>>>>> <mailto:nbala...@redhat.com>> wrote:
>>>>> Hi Mauro,
>>>>> 
>>>>> It looks like some problem on s06. Are all your other nodes ok? Can you 
>>>>> send us the gluster logs from this node?
>>>>> 
>>>>> @Raghavendra G <mailto:raghaven...@gluster.com> , do you have any idea as 
>>>>> to how this can be debugged? Maybe running top ? Or debug brick logs?
>>>>> 
>>>>> If we can reproduce the problem, collecting tcpdump on both ends of 
>>>>> connection will help. But, one common problem is these bugs are 
>>>>> inconsistently reproducible and hence we may not be able to capture 
>>>>> tcpdump at correct intervals. Other than that, we can try to collect some 
>>>>> evidence that poller threads were busy (waiting on locks). But, not sure 
>>>>> what debug data provides that information.
>>>>> 
>>>>> From what I know, its difficult to collect evidence for this issue and we 
>>>>> could only reason about it.
>>>>> 
>>>>> We can try a workaround though - try increasing server.event-threads and 
>>>>> see whether ping-timer expiry issues go away with an optimal value. If 
>>>>> that's the case, it kind of provides proof for our hypothesis.
>>>>> 
>>>>> 
>>>>> 
>>>>> Regards,
>>>>> Nithya
>>>>> 
>>>>> On Mon, 4 Mar 2019 at 15:25, Mauro Tridici <mauro.trid...@cmcc.it 
>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>> Hi All,
>>>>> 
>>>>> some minutes ago I received this message from NAGIOS server 
>>>>> 
>>>>> ***** Nagios *****
>>>>> 
>>>>> Notification Type: PROBLEM
>>>>> 
>>>>> Service: Brick - /gluster/mnt2/brick
>>>>> Host: s06
>>>>> Address: s06-stg
>>>>> State: CRITICAL
>>>>> 
>>>>> Date/Time: Mon Mar 4 10:25:33 CET 2019
>>>>> 
>>>>> Additional Info:
>>>>> CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
>>>>> 
>>>>> I checked the network, RAM and CPUs usage on s06 node and everything 
>>>>> seems to be ok.
>>>>> No bricks are in error state. In /var/log/messages, I detected again a 
>>>>> crash of “check_vol_utili” that I think it is a module used by NRPE 
>>>>> executable (that is the NAGIOS client).
>>>>> 
>>>>> Mar  4 10:15:29 s06 kernel: traps: check_vol_utili[161224] general 
>>>>> protection ip:7facffa0a66d sp:7ffe9f4e6fc0 error:0 in 
>>>>> libglusterfs.so.0.0.1[7facff9b7000+f7000]
>>>>> Mar  4 10:15:29 s06 abrt-hook-ccpp: Process 161224 (python2.7) of user 0 
>>>>> killed by SIGSEGV - dumping core
>>>>> Mar  4 10:15:29 s06 abrt-server: Generating core_backtrace
>>>>> Mar  4 10:15:29 s06 abrt-server: Error: Unable to open './coredump': No 
>>>>> such file or directory
>>>>> Mar  4 10:16:01 s06 systemd: Created slice User Slice of root.
>>>>> Mar  4 10:16:01 s06 systemd: Starting User Slice of root.
>>>>> Mar  4 10:16:01 s06 systemd: Started Session 201010 of user root.
>>>>> Mar  4 10:16:01 s06 systemd: Starting Session 201010 of user root.
>>>>> Mar  4 10:16:01 s06 systemd: Removed slice User Slice of root.
>>>>> Mar  4 10:16:01 s06 systemd: Stopping User Slice of root.
>>>>> Mar  4 10:16:24 s06 abrt-server: Duplicate: UUID
>>>>> Mar  4 10:16:24 s06 abrt-server: DUP_OF_DIR: 
>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
>>>>> Mar  4 10:16:24 s06 abrt-server: Deleting problem directory 
>>>>> ccpp-2019-03-04-10:15:29-161224 (dup of ccpp-2018-09-25-12:27:42-13041)
>>>>> Mar  4 10:16:24 s06 abrt-server: Generating core_backtrace
>>>>> Mar  4 10:16:24 s06 abrt-server: Error: Unable to open './coredump': No 
>>>>> such file or directory
>>>>> Mar  4 10:16:24 s06 abrt-server: Cannot notify 
>>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 
>>>>> 'report_uReport' exited with 1
>>>>> Mar  4 10:16:24 s06 abrt-hook-ccpp: Process 161391 (python2.7) of user 0 
>>>>> killed by SIGABRT - dumping core
>>>>> Mar  4 10:16:25 s06 abrt-server: Generating core_backtrace
>>>>> Mar  4 10:16:25 s06 abrt-server: Error: Unable to open './coredump': No 
>>>>> such file or directory
>>>>> Mar  4 10:17:01 s06 systemd: Created slice User Slice of root.
>>>>> 
>>>>> Also, I noticed the following errors that I think are very critical:
>>>>> 
>>>>> Mar  4 10:21:12 s06 glustershd[20355]: [2019-03-04 09:21:12.954798] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-55: 
>>>>> server 192.168.0.55:49158 <http://192.168.0.55:49158/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:22:01 s06 systemd: Created slice User Slice of root.
>>>>> Mar  4 10:22:01 s06 systemd: Starting User Slice of root.
>>>>> Mar  4 10:22:01 s06 systemd: Started Session 201017 of user root.
>>>>> Mar  4 10:22:01 s06 systemd: Starting Session 201017 of user root.
>>>>> Mar  4 10:22:01 s06 systemd: Removed slice User Slice of root.
>>>>> Mar  4 10:22:01 s06 systemd: Stopping User Slice of root.
>>>>> Mar  4 10:22:03 s06 glustershd[20355]: [2019-03-04 09:22:03.964120] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: 
>>>>> server 192.168.0.54:49165 <http://192.168.0.54:49165/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:23:01 s06 systemd: Created slice User Slice of root.
>>>>> Mar  4 10:23:01 s06 systemd: Starting User Slice of root.
>>>>> Mar  4 10:23:01 s06 systemd: Started Session 201018 of user root.
>>>>> Mar  4 10:23:01 s06 systemd: Starting Session 201018 of user root.
>>>>> Mar  4 10:23:02 s06 systemd: Removed slice User Slice of root.
>>>>> Mar  4 10:23:02 s06 systemd: Stopping User Slice of root.
>>>>> Mar  4 10:24:01 s06 systemd: Created slice User Slice of root.
>>>>> Mar  4 10:24:01 s06 systemd: Starting User Slice of root.
>>>>> Mar  4 10:24:01 s06 systemd: Started Session 201019 of user root.
>>>>> Mar  4 10:24:01 s06 systemd: Starting Session 201019 of user root.
>>>>> Mar  4 10:24:01 s06 systemd: Removed slice User Slice of root.
>>>>> Mar  4 10:24:01 s06 systemd: Stopping User Slice of root.
>>>>> Mar  4 10:24:03 s06 glustershd[20355]: [2019-03-04 09:24:03.982502] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-16: 
>>>>> server 192.168.0.52:49158 <http://192.168.0.52:49158/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746109] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-3: 
>>>>> server 192.168.0.51:49153 <http://192.168.0.51:49153/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746215] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: 
>>>>> server 192.168.0.52:49156 <http://192.168.0.52:49156/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746260] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-21: 
>>>>> server 192.168.0.51:49159 <http://192.168.0.51:49159/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746296] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: 
>>>>> server 192.168.0.52:49161 <http://192.168.0.52:49161/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:05 s06 quotad[20374]: [2019-03-04 09:24:05.746413] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-60: 
>>>>> server 192.168.0.54:49165 <http://192.168.0.54:49165/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:07 s06 glustershd[20355]: [2019-03-04 09:24:07.982952] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-45: 
>>>>> server 192.168.0.54:49155 <http://192.168.0.54:49155/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:18 s06 glustershd[20355]: [2019-03-04 09:24:18.990929] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-25: 
>>>>> server 192.168.0.52:49161 <http://192.168.0.52:49161/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:24:31 s06 glustershd[20355]: [2019-03-04 09:24:31.995781] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-20: 
>>>>> server 192.168.0.53:49159 <http://192.168.0.53:49159/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:25:01 s06 systemd: Created slice User Slice of root.
>>>>> Mar  4 10:25:01 s06 systemd: Starting User Slice of root.
>>>>> Mar  4 10:25:01 s06 systemd: Started Session 201020 of user root.
>>>>> Mar  4 10:25:01 s06 systemd: Starting Session 201020 of user root.
>>>>> Mar  4 10:25:01 s06 systemd: Removed slice User Slice of root.
>>>>> Mar  4 10:25:01 s06 systemd: Stopping User Slice of root.
>>>>> Mar  4 10:25:57 s06 systemd: Created slice User Slice of root.
>>>>> Mar  4 10:25:57 s06 systemd: Starting User Slice of root.
>>>>> Mar  4 10:25:57 s06 systemd-logind: New session 201021 of user root.
>>>>> Mar  4 10:25:57 s06 systemd: Started Session 201021 of user root.
>>>>> Mar  4 10:25:57 s06 systemd: Starting Session 201021 of user root.
>>>>> Mar  4 10:26:01 s06 systemd: Started Session 201022 of user root.
>>>>> Mar  4 10:26:01 s06 systemd: Starting Session 201022 of user root.
>>>>> Mar  4 10:26:21 s06 nrpe[162388]: Error: Could not complete SSL handshake 
>>>>> with 192.168.1.56 <http://192.168.1.56/>: 5
>>>>> Mar  4 10:27:01 s06 systemd: Started Session 201023 of user root.
>>>>> Mar  4 10:27:01 s06 systemd: Starting Session 201023 of user root.
>>>>> Mar  4 10:28:01 s06 systemd: Started Session 201024 of user root.
>>>>> Mar  4 10:28:01 s06 systemd: Starting Session 201024 of user root.
>>>>> Mar  4 10:29:01 s06 systemd: Started Session 201025 of user root.
>>>>> Mar  4 10:29:01 s06 systemd: Starting Session 201025 of user root.
>>>>> 
>>>>> But, unfortunately, I don’t understand why it is happening.
>>>>> Now, NAGIOS server shows that s06 status is ok:
>>>>> 
>>>>> ***** Nagios *****
>>>>> 
>>>>> Notification Type: RECOVERY
>>>>> 
>>>>> Service: Brick - /gluster/mnt2/brick
>>>>> Host: s06
>>>>> Address: s06-stg
>>>>> State: OK
>>>>> 
>>>>> Date/Time: Mon Mar 4 10:35:23 CET 2019
>>>>> 
>>>>> Additional Info:
>>>>> OK: Brick /gluster/mnt2/brick is up
>>>>> 
>>>>> Nothing is changed from RAM, CPUs, and NETWORK point of view.
>>>>> /var/log/message file has been updated:
>>>>> 
>>>>> Mar  4 10:32:01 s06 systemd: Starting Session 201029 of user root.
>>>>> Mar  4 10:32:30 s06 glustershd[20355]: [2019-03-04 09:32:30.069082] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-10: 
>>>>> server 192.168.0.52:49156 <http://192.168.0.52:49156/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:32:55 s06 glustershd[20355]: [2019-03-04 09:32:55.074689] C 
>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-66: 
>>>>> server 192.168.0.54:49167 <http://192.168.0.54:49167/> has not responded 
>>>>> in the last 42 seconds, disconnecting.
>>>>> Mar  4 10:33:01 s06 systemd: Started Session 201030 of user root.
>>>>> Mar  4 10:33:01 s06 systemd: Starting Session 201030 of user root.
>>>>> Mar  4 10:34:01 s06 systemd: Started Session 201031 of user root.
>>>>> Mar  4 10:34:01 s06 systemd: Starting Session 201031 of user root.
>>>>> Mar  4 10:35:01 s06 nrpe[162562]: Could not read request from client 
>>>>> 192.168.1.56, bailing out...
>>>>> Mar  4 10:35:01 s06 nrpe[162562]: INFO: SSL Socket Shutdown.
>>>>> Mar  4 10:35:01 s06 systemd: Started Session 201032 of user root.
>>>>> Mar  4 10:35:01 s06 systemd: Starting Session 201032 of user root.
>>>>> 
>>>>> Could you please help me to understand what it’s happening ?
>>>>> Thank you in advance.
>>>>> 
>>>>> Rergards,
>>>>> Mauro
>>>>> 
>>>>> 
>>>>>> On 1 Mar 2019, at 12:17, Mauro Tridici <mauro.trid...@cmcc.it 
>>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>> 
>>>>>> 
>>>>>> Thank you, Milind.
>>>>>> I executed the instructions you suggested:
>>>>>> 
>>>>>> - grep “blocked for” /var/log/messages on s06 returns no output (no 
>>>>>> “blocked” word is detected in messages file);
>>>>>> - in /var/log/messages file I can see this kind of error repeated for a 
>>>>>> lot of times:
>>>>>> 
>>>>>> Mar  1 08:43:01 s06 systemd: Starting Session 196071 of user root.
>>>>>> Mar  1 08:43:01 s06 systemd: Removed slice User Slice of root.
>>>>>> Mar  1 08:43:01 s06 systemd: Stopping User Slice of root.
>>>>>> Mar  1 08:43:02 s06 kernel: traps: check_vol_utili[57091] general 
>>>>>> protection ip:7f88e76ee66d sp:7ffe5a5bcc30 error:0 in 
>>>>>> libglusterfs.so.0.0.1[7f88e769b000+f7000]
>>>>>> Mar  1 08:43:02 s06 abrt-hook-ccpp: Process 57091 (python2.7) of user 0 
>>>>>> killed by SIGSEGV - dumping core
>>>>>> Mar  1 08:43:02 s06 abrt-server: Generating core_backtrace
>>>>>> Mar  1 08:43:02 s06 abrt-server: Error: Unable to open './coredump': No 
>>>>>> such file or directory
>>>>>> Mar  1 08:43:58 s06 abrt-server: Duplicate: UUID
>>>>>> Mar  1 08:43:58 s06 abrt-server: DUP_OF_DIR: 
>>>>>> /var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041
>>>>>> Mar  1 08:43:58 s06 abrt-server: Deleting problem directory 
>>>>>> ccpp-2019-03-01-08:43:02-57091 (dup of ccpp-2018-09-25-12:27:42-13041)
>>>>>> Mar  1 08:43:58 s06 dbus[1872]: [system] Activating service 
>>>>>> name='org.freedesktop.problems' (using servicehelper)
>>>>>> Mar  1 08:43:58 s06 dbus[1872]: [system] Successfully activated service 
>>>>>> 'org.freedesktop.problems'
>>>>>> Mar  1 08:43:58 s06 abrt-server: Generating core_backtrace
>>>>>> Mar  1 08:43:58 s06 abrt-server: Error: Unable to open './coredump': No 
>>>>>> such file or directory
>>>>>> Mar  1 08:43:58 s06 abrt-server: Cannot notify 
>>>>>> '/var/tmp/abrt/ccpp-2018-09-25-12:27:42-13041' via uReport: Event 
>>>>>> 'report_uReport' exited with 1
>>>>>> Mar  1 08:44:01 s06 systemd: Created slice User Slice of root.
>>>>>> Mar  1 08:44:01 s06 systemd: Starting User Slice of root.
>>>>>> Mar  1 08:44:01 s06 systemd: Started Session 196072 of user root.
>>>>>> Mar  1 08:44:01 s06 systemd: Starting Session 196072 of user root.
>>>>>> Mar  1 08:44:01 s06 systemd: Removed slice User Slice of root.
>>>>>> 
>>>>>> - in /var/log/messages file I can see also 4 errors related to other 
>>>>>> cluster servers:
>>>>>> 
>>>>>> Mar  1 11:05:01 s06 systemd: Starting User Slice of root.
>>>>>> Mar  1 11:05:01 s06 systemd: Started Session 196230 of user root.
>>>>>> Mar  1 11:05:01 s06 systemd: Starting Session 196230 of user root.
>>>>>> Mar  1 11:05:01 s06 systemd: Removed slice User Slice of root.
>>>>>> Mar  1 11:05:01 s06 systemd: Stopping User Slice of root.
>>>>>> Mar  1 11:05:59 s06 glustershd[70117]: [2019-03-01 10:05:59.347094] C 
>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-33: 
>>>>>> server 192.168.0.51:49163 <http://192.168.0.51:49163/> has not responded 
>>>>>> in the last 42 seconds, disconnecting.
>>>>>> Mar  1 11:06:01 s06 systemd: Created slice User Slice of root.
>>>>>> Mar  1 11:06:01 s06 systemd: Starting User Slice of root.
>>>>>> Mar  1 11:06:01 s06 systemd: Started Session 196231 of user root.
>>>>>> Mar  1 11:06:01 s06 systemd: Starting Session 196231 of user root.
>>>>>> Mar  1 11:06:01 s06 systemd: Removed slice User Slice of root.
>>>>>> Mar  1 11:06:01 s06 systemd: Stopping User Slice of root.
>>>>>> Mar  1 11:06:12 s06 glustershd[70117]: [2019-03-01 10:06:12.351319] C 
>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-1: 
>>>>>> server 192.168.0.52:49153 <http://192.168.0.52:49153/> has not responded 
>>>>>> in the last 42 seconds, disconnecting.
>>>>>> Mar  1 11:06:38 s06 glustershd[70117]: [2019-03-01 10:06:38.356920] C 
>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-7: 
>>>>>> server 192.168.0.52:49155 <http://192.168.0.52:49155/> has not responded 
>>>>>> in the last 42 seconds, disconnecting.
>>>>>> Mar  1 11:07:01 s06 systemd: Created slice User Slice of root.
>>>>>> Mar  1 11:07:01 s06 systemd: Starting User Slice of root.
>>>>>> Mar  1 11:07:01 s06 systemd: Started Session 196232 of user root.
>>>>>> Mar  1 11:07:01 s06 systemd: Starting Session 196232 of user root.
>>>>>> Mar  1 11:07:01 s06 systemd: Removed slice User Slice of root.
>>>>>> Mar  1 11:07:01 s06 systemd: Stopping User Slice of root.
>>>>>> Mar  1 11:07:36 s06 glustershd[70117]: [2019-03-01 10:07:36.366259] C 
>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-0: 
>>>>>> server 192.168.0.51:49152 <http://192.168.0.51:49152/> has not responded 
>>>>>> in the last 42 seconds, disconnecting.
>>>>>> Mar  1 11:08:01 s06 systemd: Created slice User Slice of root.
>>>>>> 
>>>>>> No “blocked” word is in /var/log/messages files on other cluster servers.
>>>>>> In attachment, the /var/log/messages file from s06 server.
>>>>>> 
>>>>>> Thank you in advance,
>>>>>> Mauro
>>>>>> 
>>>>>> <messages.zip>
>>>>>> 
>>>>>> 
>>>>>>> On 1 Mar 2019, at 11:47, Milind Changire <mchan...@redhat.com 
>>>>>>> <mailto:mchan...@redhat.com>> wrote:
>>>>>>> 
>>>>>>> The traces of very high disk activity on the servers are often found in 
>>>>>>> /var/log/messages
>>>>>>> You might want to grep for "blocked for" in /var/log/messages on s06 
>>>>>>> and correlate the timestamps to confirm the unresponsiveness as 
>>>>>>> reported in gluster client logs.
>>>>>>> In cases of high disk activity, although the operating system continues 
>>>>>>> to respond to ICMP pings, the processes writing to disks often get 
>>>>>>> blocked to a large flush to the disk which could span beyond 42 seconds 
>>>>>>> and hence result in ping-timer-expiry logs.
>>>>>>> 
>>>>>>> As a side note:
>>>>>>> If you indeed find gluster processes being blocked in 
>>>>>>> /var/log/messages, you might want to tweak sysctl tunables called 
>>>>>>> vm.dirty_background_ratio or vm.dirty_background_bytes to a smaller 
>>>>>>> value than the existing. Please read up more on those tunables before 
>>>>>>> touching the settings.
>>>>>>> 
>>>>>>> 
>>>>>>> On Fri, Mar 1, 2019 at 4:06 PM Mauro Tridici <mauro.trid...@cmcc.it 
>>>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>>> 
>>>>>>> Hi all,
>>>>>>> 
>>>>>>> in attachment the client log captured after changing 
>>>>>>> network.ping-timeout option.
>>>>>>> I noticed this error involving server 192.168.0.56 (s06)
>>>>>>> 
>>>>>>> [2019-03-01 09:23:36.077287] I [rpc-clnt.c:1962:rpc_clnt_reconfig] 
>>>>>>> 0-tier2-client-71: changing ping timeout to 42 (from 0)
>>>>>>> [2019-03-01 09:23:36.078213] I 
>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in 
>>>>>>> volfile,continuing
>>>>>>> [2019-03-01 09:23:36.078432] I 
>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in 
>>>>>>> volfile,continuing
>>>>>>> [2019-03-01 09:23:36.092357] I 
>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in 
>>>>>>> volfile,continuing
>>>>>>> [2019-03-01 09:23:36.094146] I 
>>>>>>> [glusterfsd-mgmt.c:1894:mgmt_getspec_cbk] 0-glusterfs: No change in 
>>>>>>> volfile,continuing
>>>>>>> [2019-03-01 10:06:24.708082] C 
>>>>>>> [rpc-clnt-ping.c:166:rpc_clnt_ping_timer_expired] 0-tier2-client-50: 
>>>>>>> server 192.168.0.56:49156 <http://192.168.0.56:49156/> has not 
>>>>>>> responded in the last 42 seconds, disconnecting.
>>>>>>> 
>>>>>>> I don’t know why it happens, s06 server seems to be reachable.
>>>>>>> 
>>>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
>>>>>>> Trying 192.168.0.56...
>>>>>>> Connected to 192.168.0.56.
>>>>>>> Escape character is '^]'.
>>>>>>> ^CConnection closed by foreign host.
>>>>>>> [athena_login2][/users/home/sysm02/]> ping 192.168.0.56
>>>>>>> PING 192.168.0.56 (192.168.0.56) 56(84) bytes of data.
>>>>>>> 64 bytes from 192.168.0.56 <http://192.168.0.56/>: icmp_seq=1 ttl=64 
>>>>>>> time=0.116 ms
>>>>>>> 64 bytes from 192.168.0.56 <http://192.168.0.56/>: icmp_seq=2 ttl=64 
>>>>>>> time=0.101 ms
>>>>>>> 
>>>>>>> --- 192.168.0.56 ping statistics ---
>>>>>>> 2 packets transmitted, 2 received, 0% packet loss, time 1528ms
>>>>>>> rtt min/avg/max/mdev = 0.101/0.108/0.116/0.012 ms
>>>>>>> 
>>>>>>> [athena_login2][/users/home/sysm02/]> telnet 192.168.0.56 49156
>>>>>>> Trying 192.168.0.56...
>>>>>>> Connected to 192.168.0.56.
>>>>>>> Escape character is '^]'.
>>>>>>> 
>>>>>>> Thank you for your help,
>>>>>>> Mauro
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> On 1 Mar 2019, at 10:29, Mauro Tridici <mauro.trid...@cmcc.it 
>>>>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>>>> 
>>>>>>>> Hi all,
>>>>>>>> 
>>>>>>>> thank you for the explanation.
>>>>>>>> I just changed network.ping-timeout option to default value 
>>>>>>>> (network.ping-timeout=42).
>>>>>>>> 
>>>>>>>> I will check the logs to see if the errors will appear again.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> Mauro
>>>>>>>> 
>>>>>>>>> On 1 Mar 2019, at 04:43, Milind Changire <mchan...@redhat.com 
>>>>>>>>> <mailto:mchan...@redhat.com>> wrote:
>>>>>>>>> 
>>>>>>>>> network.ping-timeout should not be set to zero for non-glusterd 
>>>>>>>>> clients.
>>>>>>>>> glusterd is a special case for which ping-timeout is set to zero via 
>>>>>>>>> /etc/glusterfs/glusterd.vol
>>>>>>>>> 
>>>>>>>>> Setting network.ping-timeout to zero disables arming of the ping 
>>>>>>>>> timer for connections. This disables testing the connection for 
>>>>>>>>> responsiveness and hence avoids proactive fail-over.
>>>>>>>>> 
>>>>>>>>> Please reset network.ping-timeout to a non-zero positive value, eg. 42
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Feb 28, 2019 at 5:07 PM Nithya Balachandran 
>>>>>>>>> <nbala...@redhat.com <mailto:nbala...@redhat.com>> wrote:
>>>>>>>>> Adding Raghavendra and Milind to comment on this.
>>>>>>>>> 
>>>>>>>>> What is the effect of setting network.ping-timeout to 0 and should it 
>>>>>>>>> be set back to 42?
>>>>>>>>> Regards,
>>>>>>>>> Nithya
>>>>>>>>> 
>>>>>>>>> On Thu, 28 Feb 2019 at 16:01, Mauro Tridici <mauro.trid...@cmcc.it 
>>>>>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>>>>> Hi Nithya,
>>>>>>>>> 
>>>>>>>>> sorry for the late.
>>>>>>>>> network.ping-timeout has been set to 0 in order to try to solve some 
>>>>>>>>> timeout problems, but it didn’t help.
>>>>>>>>> I can set it to the default value. 
>>>>>>>>> 
>>>>>>>>> Can I proceed with the change?
>>>>>>>>> 
>>>>>>>>> Thank you,
>>>>>>>>> Mauro
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On 28 Feb 2019, at 04:41, Nithya Balachandran <nbala...@redhat.com 
>>>>>>>>>> <mailto:nbala...@redhat.com>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Mauro,
>>>>>>>>>> 
>>>>>>>>>> Is network.ping-timeout still set to 0. The default value is 42. Is 
>>>>>>>>>> there a particular reason why this was changed?
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Nithya
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Wed, 27 Feb 2019 at 21:32, Mauro Tridici <mauro.trid...@cmcc.it 
>>>>>>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Xavi,
>>>>>>>>>> 
>>>>>>>>>> thank you for the detailed explanation and suggestions.
>>>>>>>>>> Yes, transport.listen-backlog option is still set to 1024.
>>>>>>>>>> 
>>>>>>>>>> I will check the network and connectivity status using “ping” and 
>>>>>>>>>> “telnet” as soon as the errors will come back again.
>>>>>>>>>> 
>>>>>>>>>> Regards,
>>>>>>>>>> Mauro
>>>>>>>>>> 
>>>>>>>>>>> Il giorno 27 feb 2019, alle ore 16:42, Xavi Hernandez 
>>>>>>>>>>> <jaher...@redhat.com <mailto:jaher...@redhat.com>> ha scritto:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Mauro,
>>>>>>>>>>> 
>>>>>>>>>>> those errors say that the mount point is not connected to some of 
>>>>>>>>>>> the bricks while executing operations. I see references to 3rd and 
>>>>>>>>>>> 6th bricks of several disperse sets, which seem to map to server 
>>>>>>>>>>> s06. For some reason, gluster is having troubles connecting from 
>>>>>>>>>>> the client machine to that particular server. At the end of the log 
>>>>>>>>>>> I see that after long time a reconnect is done to both of them. 
>>>>>>>>>>> However little after, other bricks from the s05 get disconnected 
>>>>>>>>>>> and a reconnect times out.
>>>>>>>>>>> 
>>>>>>>>>>> That's really odd. It seems like if server/communication is cut to 
>>>>>>>>>>> s06 for some time, then restored, and then the same happens to the 
>>>>>>>>>>> next server.
>>>>>>>>>>> 
>>>>>>>>>>> If the servers are really online and it's only a communication 
>>>>>>>>>>> issue, it explains why server memory and network has increased: if 
>>>>>>>>>>> the problem only exists between the client and servers, any write 
>>>>>>>>>>> made by the client will automatically mark the file as damaged, 
>>>>>>>>>>> since some of the servers have not been updated. Since self-heal 
>>>>>>>>>>> runs from the server nodes, they will probably be correctly 
>>>>>>>>>>> connected to all bricks, which allows them to heal the just damaged 
>>>>>>>>>>> file, which increases memory and network usage.
>>>>>>>>>>> 
>>>>>>>>>>> I guess you still have transport.listen-backlog set to 1024, right ?
>>>>>>>>>>> 
>>>>>>>>>>> Just to try to identify if the problem really comes from network, 
>>>>>>>>>>> can you check if you lose some pings from the client to all of the 
>>>>>>>>>>> servers while you are seeing those errors in the log file ?
>>>>>>>>>>> 
>>>>>>>>>>> You can also check if during those errors, you can telnet to the 
>>>>>>>>>>> port of the brick from the client.
>>>>>>>>>>> 
>>>>>>>>>>> Xavi
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Feb 26, 2019 at 10:17 AM Mauro Tridici 
>>>>>>>>>>> <mauro.trid...@cmcc.it <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>>>>>>> Hi Nithya,
>>>>>>>>>>> 
>>>>>>>>>>> “df -h” operation is not still slow, but no users are using the 
>>>>>>>>>>> volume, RAM and NETWORK usage is ok on the client node.
>>>>>>>>>>> 
>>>>>>>>>>> I was worried about this kind of warnings/errors:
>>>>>>>>>>> 
>>>>>>>>>>> [2019-02-25 10:59:00.664323] W [MSGID: 122035] 
>>>>>>>>>>> [ec-common.c:571:ec_child_select] 0-tier2-disperse-6: Executing 
>>>>>>>>>>> operation with some subvolumes unavailable (20)
>>>>>>>>>>> 
>>>>>>>>>>> [2019-02-26 03:11:35.212603] E [rpc-clnt.c:350:saved_frames_unwind] 
>>>>>>>>>>> (--> 
>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3] 
>>>>>>>>>>> (--> 
>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] 
>>>>>>>>>>> (--> 
>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] 
>>>>>>>>>>> (--> 
>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>  (--> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] 
>>>>>>>>>>> ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 3.3) 
>>>>>>>>>>> op(WRITE(13)) called at 2019-02-26 03:10:56.549903 (xid=0x106f1c5)
>>>>>>>>>>> 
>>>>>>>>>>> [2019-02-26 03:13:03.313831] E 
>>>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-50: connection 
>>>>>>>>>>> to 192.168.0.56:49156 <http://192.168.0.56:49156/> failed (Timeout 
>>>>>>>>>>> della connessione); disconnecting socket
>>>>>>>>>>> 
>>>>>>>>>>> It seems that some subvolumes are not available and 192.168.0.56 
>>>>>>>>>>> server (s06) is not reachable.
>>>>>>>>>>> But gluster servers are up&running and bricks are ok.
>>>>>>>>>>> 
>>>>>>>>>>> In attachment the updated tier2.log file.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Thank you.
>>>>>>>>>>> Regards,
>>>>>>>>>>> Mauro
>>>>>>>>>>> 
>>>>>>>>>>>> Il giorno 26 feb 2019, alle ore 04:03, Nithya Balachandran 
>>>>>>>>>>>> <nbala...@redhat.com <mailto:nbala...@redhat.com>> ha scritto:
>>>>>>>>>>>> 
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> 
>>>>>>>>>>>> I see a lot of EC messages in the log but they don't seem very 
>>>>>>>>>>>> serious. Xavi, can you take a look?
>>>>>>>>>>>> 
>>>>>>>>>>>> The only errors I see are:
>>>>>>>>>>>> [2019-02-25 10:58:45.519871] E 
>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> 
>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3]
>>>>>>>>>>>>  (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] 
>>>>>>>>>>>> (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] 
>>>>>>>>>>>> (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>>  (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] 
>>>>>>>>>>>> ))))) 0-tier2-client-50: forced unwinding frame type(GlusterFS 
>>>>>>>>>>>> 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.429969 
>>>>>>>>>>>> (xid=0xd26fe7)
>>>>>>>>>>>> [2019-02-25 10:58:51.461493] E 
>>>>>>>>>>>> [rpc-clnt.c:350:saved_frames_unwind] (--> 
>>>>>>>>>>>> /usr/lib64/libglusterfs.so.0(_gf_log_callingfn+0x153)[0x3d0cc2f2e3]
>>>>>>>>>>>>  (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_unwind+0x1e5)[0x3d0d410935] 
>>>>>>>>>>>> (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x3d0d410a7e] 
>>>>>>>>>>>> (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xa5)[0x3d0d410b45]
>>>>>>>>>>>>  (--> 
>>>>>>>>>>>> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x278)[0x3d0d410e68] 
>>>>>>>>>>>> ))))) 0-tier2-client-41: forced unwinding frame type(GlusterFS 
>>>>>>>>>>>> 3.3) op(WRITE(13)) called at 2019-02-25 10:57:47.499174 
>>>>>>>>>>>> (xid=0xf47d6a)
>>>>>>>>>>>> [2019-02-25 11:07:57.152874] E 
>>>>>>>>>>>> [socket.c:2376:socket_connect_finish] 0-tier2-client-70: 
>>>>>>>>>>>> connection to 192.168.0.55:49163 <http://192.168.0.55:49163/> 
>>>>>>>>>>>> failed (Timeout della connessione); disconnecting socket
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Is the df -h operation still slow? If yes, can you take a tcpdump 
>>>>>>>>>>>> of the client while running df -h and send that across?
>>>>>>>>>>>> 
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Nithya
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, 25 Feb 2019 at 17:27, Mauro Tridici <mauro.trid...@cmcc.it 
>>>>>>>>>>>> <mailto:mauro.trid...@cmcc.it>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>> Sorry, some minutes after my last mail message, I noticed that “df 
>>>>>>>>>>>> -h” command hanged for a while before returns the prompt.
>>>>>>>>>>>> Yesterday, everything was ok in the gluster client log, but, 
>>>>>>>>>>>> today, I see a lot of errors (please, take a look to the attached 
>>>>>>>>>>>> file).
>>>>>>>>>>>> 
>>>>>>>>>>>> On the client node, I detected an important RAM e NETWORK usage.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> Do you think that the errors have been caused by the client 
>>>>>>>>>>>> resources usage?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thank you in advance,
>>>>>>>>>>>> Mauro
>>>>>>>>>>>> 
>>>>> 
>>> 
>>> 
>>> 
>> 
>> 
> 
>

_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] [Gluster-users] "rpc_clnt_ping_timer_expired" errors

Reply via email to