Just to add maybe a helpful observation: either "cib" or "pengine" process goes to ~100% CPU when this remote nodes errors happen.
On Tue, Sep 27, 2016 at 2:36 PM, Radoslaw Garbacz < radoslaw.garb...@xtremedatainc.com> wrote: > Hi, > > I encountered the same problem with pacemaker built from github at around > August 22. > > Remote nodes go offline occasionally and stay so, their logs show same > errors. The cluster is on AWS ec2 instances, the network works and is an > unlikely reason. > > Have there be any commits on github recently (after August 22) addressing > this issue? > > > Logs: > [...] > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_abort: crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_remote_header: Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_abort: crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_remote_header: Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_abort: crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_remote_header: Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > ipc_proxy_remove_provider: ipc proxy connection for client > ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster > node disconnected. > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > cancel_recurring_action: Cancelling ocf operation > monitor_all_monitor_191000 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_send_tls: Connection terminated rc = -53 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_send_tls: Connection terminated rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > crm_remote_send: Failed to send remote msg, rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: error: > lrmd_tls_send_msg: Failed to send remote lrmd tls msg, rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: warning: > send_client_notify: Notification of client > remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22 > failed > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: notice: > lrmd_remote_client_destroy: LRMD client disconnecting remote client > - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296- > 725735d17e22 > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted: error: > ipc_proxy_accept: No ipc providers available for uid 0 gid 0 > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted: error: > handle_new_connection: Error in connection setup (19626-21815-14): > Remote I/O error (121) > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted: error: > ipc_proxy_accept: No ipc providers available for uid 0 gid 0 > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted: error: > handle_new_connection: Error in connection setup (19626-21815-14): > Remote I/O error (121) > [...] > > > > > On Thu, Jun 9, 2016 at 12:24 AM, Narayanamoorthy Srinivasan < > narayanamoort...@gmail.com> wrote: > >> Don't see any issues in network traffic. >> >> Some more logs where the XML tags are incomplete: >> >> 2016-06-09T03:06:03.096449+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="fs-postgresql_last_0" operation_key="fs-postgresql_stop_0" >> operation="stop" crm-debug-origin="do_update_resource" >> crm_feature_set="3.0.10" transition-key="225:116:0:8fbf >> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;225:116: >> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd" >> call-id="489" rc-code="0" op-status="0" interval="0" last-run="1459491026" >> last-rc-change="1459491026" exec-time="158" queue-time="0" >> op-digest="dfb0c861 >> 2016-06-09T03:06:03.097136+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="fs-postgresql_last_failure_0" >> operation_key="fs-postgresql_monitor_0" operation="monitor" >> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" >> transition-key="41:4:7:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" >> transition-magic="0:0;41:4:7:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" >> on_node="d00-50-56-94-24-dd" call-id="5" rc-code="0" op-status="0" >> interval="0" last-run="1459429072" last-rc-change="1459429072" >> exec-time="315" queue-time="0" op-digest="df >> 2016-06-09T03:06:03.097361+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="fs-postgresql_monitor_10000" >> operation_key="fs-postgresql_monitor_10000" operation="monitor" >> crm-debug-origin="do_update_resource" crm_feature_set="3.0.10" >> transition-key="224:107:0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" >> transition-magic="0:0;224:107:0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" >> on_node="d00-50-56-94-24-dd" call-id="365" rc-code="0" op-status="0" >> interval="10000" last-rc-change="1459490849" exec-time="185" queue-time="0" >> op-digest="cd8d3642c >> 2016-06-09T03:06:03.097582+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> </lrm_resource> >> 2016-06-09T03:06:03.097690+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_resource id="vip-admin-database-default-proposal-controller" >> type="IPaddr2" class="ocf" provider="heartbeat"> >> 2016-06-09T03:06:03.097797+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="vip-admin-database-default-proposal-controller_last_0" >> operation_key="vip-admin-database-default-proposal-controller_stop_0" >> operation="stop" crm-debug-origin="do_update_resource" >> crm_feature_set="3.0.10" transition-key="228:116:0:8fbf >> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;228:116: >> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd" >> call-id="487" rc-code="0" op-status="0" interval="0" last-run="1459491026" >> last-rc-chan >> 2016-06-09T03:06:03.098013+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="vip-admin-database-default-proposal-controller_monitor_10000" >> operation_key="vip-admin-database-default-proposal-controller_monitor_10000" >> operation="monitor" crm-debug-origin="do_update_resource" >> crm_feature_set="3.0.10" transition-key="227:107:0:8fbf >> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;227:107: >> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd" >> call-id="369" rc-code="0" op-status="0" interval="10000" last-rc-chang >> 2016-06-09T03:06:03.098230+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> </lrm_resource> >> 2016-06-09T03:06:03.098337+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_resource id="postgresql" type="pgsql" class="ocf" provider="heartbeat"> >> 2016-06-09T03:06:03.098468+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="postgresql_last_0" operation_key="postgresql_stop_0" >> operation="stop" crm-debug-origin="do_update_resource" >> crm_feature_set="3.0.10" transition-key="231:116:0:8fbf >> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;231:116: >> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd" >> call-id="481" rc-code="0" op-status="0" interval="0" last-run="1459491025" >> last-rc-change="1459491025" exec-time="1334" queue-time="0" >> op-digest="f2317cad3d54c >> 2016-06-09T03:06:03.099061+05:30 d18-fb-7b-18-f1-8e >> pacemaker_remoted[6153]: error: Partial >> <lrm_rsc_op id="postgresql_monitor_10000" >> operation_key="postgresql_monitor_10000" >> operation="monitor" crm-debug-origin="do_update_resource" >> crm_feature_set="3.0.10" transition-key="230:107:0:8fbf >> 83fd-241b-4623-8bbe-31d92e4dfce1" transition-magic="0:0;230:107: >> 0:8fbf83fd-241b-4623-8bbe-31d92e4dfce1" on_node="d00-50-56-94-24-dd" >> call-id="372" rc-code="0" op-status="0" interval="10000" >> last-rc-change="1459490852" exec-time="424" queue-time="0" >> op-digest="873ed4f07792aa8 >> >> >> On Wed, Jun 8, 2016 at 10:28 PM, Narayanamoorthy Srinivasan < >> narayanamoort...@gmail.com> wrote: >> >>> No recent network changes. Will check for abnormal traffic using >>> wireshark. >>> >>> I also notice that the XML lines are partial (no ending '>', closing " >>> and sometimes partial words) in logs. Any lines > 472 characters are >>> truncated to 472 characters. Wondering is it due to anyother limitations. >>> >>> I can post some line tomorrow when i am back to work. >>> >>> >>> On Wed, Jun 8, 2016 at 8:00 PM, Ken Gaillot <kgail...@redhat.com> wrote: >>> >>>> On 06/08/2016 06:14 AM, Narayanamoorthy Srinivasan wrote: >>>> > I have a pacemaker cluster with two pacemaker remote nodes. Recently >>>> the >>>> > remote nodes started throwing below errors and SDB started >>>> self-fencing. >>>> > Appreciate if someone throws light on what could be the issue and the >>>> fix. >>>> > >>>> > OS - SLES 12 SP1 >>>> > Pacemaker Remote version - pacemaker-remote-1.1.13-14.7.x86_64 >>>> > >>>> > 2016-06-08T14:11:46.009073+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: Entity: line 1: parser >>>> > error : AttValue: ' expected >>>> > 2016-06-08T14:11:46.009314+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>>> > crm-debug-origin="do_update_ >>>> > 2016-06-08T14:11:46.009443+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > ^ >>>> > 2016-06-08T14:11:46.009567+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: Entity: line 1: parser >>>> > error : attributes construct error >>>> > 2016-06-08T14:11:46.009697+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>>> > crm-debug-origin="do_update_ >>>> > 2016-06-08T14:11:46.009824+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > ^ >>>> > 2016-06-08T14:11:46.009948+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: Entity: line 1: parser >>>> > error : Couldn't find end of Start Tag lrm_rsc_op line 1 >>>> > 2016-06-08T14:11:46.010070+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>>> > crm-debug-origin="do_update_ >>>> > 2016-06-08T14:11:46.010191+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > ^ >>>> > 2016-06-08T14:11:46.010460+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: Entity: line 1: parser >>>> > error : Premature end of data in tag lrm_resource line 1 >>>> > 2016-06-08T14:11:46.010718+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > key="neutron-ha-tool_monitor_0" operation="monitor" >>>> > crm-debug-origin="do_update_ >>>> > 2016-06-08T14:11:46.010977+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: >>>> > ^ >>>> > 2016-06-08T14:11:46.011234+05:30 d18-fb-7b-18-f1-8e >>>> > pacemaker_remoted[6190]: error: XML Error: Entity: line 1: parser >>>> > error : Premature end of data in tag lrm_resources line 1 >>>> > >>>> > >>>> > -- >>>> > Thanks & Regards >>>> > Moorthy >>>> >>>> This sounds like the network traffic between the cluster nodes and the >>>> remote nodes is being corrupted. Have there been any network changes >>>> lately? Switch/firewall/etc. equipment/settings? MTU? >>>> >>>> You could try using a packet sniffer such as wireshark to see if the >>>> traffic looks abnormal in some way. The payload is XML so it should be >>>> more or less readable. >>>> >>>> >>>> _______________________________________________ >>>> Users mailing list: Users@clusterlabs.org >>>> http://clusterlabs.org/mailman/listinfo/users >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc >>>> /Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>>> >>> >>> >>> >>> -- >>> Thanks & Regards >>> Moorthy >>> >> >> >> >> -- >> Thanks & Regards >> Moorthy >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> >> > > > -- > Best Regards, > > Radoslaw Garbacz > XtremeData Incorporation > -- Best Regards, Radoslaw Garbacz XtremeData Incorporation
_______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org