Re: [ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch
Hi, I am sorry for the long delay, just I moved on with my tool and had to narrow down the circumstances of this error. So apparently this error is related to many remote nodes querying CIB (I had an OCF monitor agent running on all nodes), once turned off it works with ~100 nodes. Parts of the logs from one of the remote nodes, which went offline: [...] Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:569 ) trace: crm_remote_recv_once:non-blocking, exiting read: Resource temporarily unavailable (-11) Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:588 ) trace: crm_remote_recv_once:Read less than the advertised length: 622632 < 639978 bytes Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:644 ) debug: crm_remote_recv:recv() failed: Resource temporarily unavailable (-11) Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:625 ) trace: crm_remote_recv:waiting to receive remote msg, starting timeout 6, remaining_timeout 1000 Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:566 ) trace: crm_remote_recv_once:Received 16384 more bytes, 639016 total Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:588 ) trace: crm_remote_recv_once:Read less than the advertised length: 639016 < 639978 bytes Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:644 ) debug: crm_remote_recv:recv() failed: Resource temporarily unavailable (-11) Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( utils.c:1026 ) error: crm_abort:crm_remote_parse_buffer: Forked child 15923 to record non-fatal assert at remote.c:426 : remote->buffer[sizeof(struct crm_remote_header_v0) + header->payload_uncompressed - 1] == 0 Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2668 ) error: crm_xml_err:XML Error: Entity: line 1: parser error : AttValue: ' expected Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2668 ) error: crm_xml_err:XML Error: Entity: line 1: parser error : attributes construct error Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2668 ) error: crm_xml_err:XML Error: Entity: line 1: parser error : Couldn't find end of Start Tag lrm_rsc_op line 1 Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2668 ) error: crm_xml_err:XML Error: Entity: line 1: parser error : Premature end of data in tag lrm_resource line 1 [...] Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2731 ) warning: string2xml:Parsing failed (domain=1, level=3, code=77): Premature end of data in tag lrmd_ipc_proxy line 1 Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2738 ) error: string2xml:Couldn't fully parse 638976 chars: Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial [...] Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( xml.c:2740 ) error: string2xml:Partial Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( strings.c:348 )info: crm_compress_string:Compressed 638868 bytes into 20773 (ratio 30:1) in 112ms Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:361 ) trace: crm_remote_send:Sending len[0]=40, start=6d726c3c Oct 18 16:02:03 [10504] ip-10-203-186-119 pacemaker_remoted: ( remote.c:237 ) trace: crm_send_tls:Mes
Re: [ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch
On 30/09/16 11:28 -0500, Radoslaw Garbacz wrote: > I have posted a question about this error attached to another thread, but > because it was old and there is no answer I thought it could have been > missed, so I am sorry for repeating it. > > Regarding the problem. > I have a cluster, and when the cluster gets bigger (around 40 remote nodes) > some remote nodes go offline after a while and their logs report some > message errors, there is no indication about anything wrong in the other > logs. I believe I would have a plausible explanation provided it may happen (not sure now, perhaps the ipc proxy setup would allow it) that two messages via the same connection are transmitted, with the second one being read as part of the first one. Could you please try running pacemaker_remoted with "PCMK_trace_files=remote.c" in the respective "sysconfig" file? > Details: > - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote > - maybe irrelevant, but either "cib" or "pengine" process goes to ~100% CPU > - it does not happen immediately > - smaller cluster (~20 remote nodes) does not show any problems > - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64 > - corosync: 2.4.1-1.2.0da1.el6.x86_64 > - libqb-1.0.0-1.28.4dff.el6.x86_64 > - CentOS 6 > > Logs: > > [...] > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_header:Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_header:Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_abort:crm_remote_header: Triggered assert at remote.c:119 : > endian == ENDIAN_LOCAL > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_header:Invalid message detected, endian mismatch: > badadbbd is neither 63646330 nor the swab'd 30636463 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > ipc_proxy_remove_provider:ipc proxy connection for client > ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster > node disconnected. > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: > cancel_recurring_action: Cancelling ocf operation > monitor_all_monitor_191000 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_send_tls: Connection terminated rc = -53 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_send_tls: Connection terminated rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > crm_remote_send: Failed to send remote msg, rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: > lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10 > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: warning: > send_client_notify: Notification of client > remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22 > failed > Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: notice: > lrmd_remote_client_destroy: LRMD client disconnecting remote client - > name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296- > 725735d17e22 > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: > ipc_proxy_accept: No ipc providers available for uid 0 gid 0 > Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: > handle_new_connection:Error in connection setup (19626-21815-14): > Remote I/O error (121) > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: > ipc_proxy_accept: No ipc providers available for uid 0 gid 0 > Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: > handle_new_connection:Error in connection setup (19626-21815-14): > Remote I/O error (121) > [...] -- Jan (Poki) pgp4MIpK8j5hh.pgp Description: PGP signature ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Pacemaker remote - invalid message detected, endian mismatch
Hi, I have posted a question about this error attached to another thread, but because it was old and there is no answer I thought it could have been missed, so I am sorry for repeating it. Regarding the problem. I have a cluster, and when the cluster gets bigger (around 40 remote nodes) some remote nodes go offline after a while and their logs report some message errors, there is no indication about anything wrong in the other logs. Details: - 40 ec2 m3.xlarge nodes, 1 corosync ring member, 39 remote - maybe irrelevant, but either "cib" or "pengine" process goes to ~100% CPU - it does not happen immediately - smaller cluster (~20 remote nodes) does not show any problems - pacemaker: 1.1.15-1.1f8e642.git.el6.x86_64 - corosync: 2.4.1-1.2.0da1.el6.x86_64 - libqb-1.0.0-1.28.4dff.el6.x86_64 - CentOS 6 Logs: [...] Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_abort:crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_header:Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_abort:crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_header:Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_abort:crm_remote_header: Triggered assert at remote.c:119 : endian == ENDIAN_LOCAL Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_header:Invalid message detected, endian mismatch: badadbbd is neither 63646330 nor the swab'd 30636463 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: lrmd_remote_client_msg: Client disconnect detected in tls msg dispatcher. Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: ipc_proxy_remove_provider:ipc proxy connection for client ca8df213-6da7-4c42-8cb3-b8bc0887f2ce pid 21815 destroyed because cluster node disconnected. Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: info: cancel_recurring_action: Cancelling ocf operation monitor_all_monitor_191000 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_send_tls: Connection terminated rc = -53 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_send_tls: Connection terminated rc = -10 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: crm_remote_send: Failed to send remote msg, rc = -10 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted:error: lrmd_tls_send_msg:Failed to send remote lrmd tls msg, rc = -10 Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: warning: send_client_notify: Notification of client remote-lrmd-ip-10-237-223-67:3121/b6034d3a-e296-492f-b296-725735d17e22 failed Sep 27 17:18:31 [19626] ip-10-237-223-67 pacemaker_remoted: notice: lrmd_remote_client_destroy: LRMD client disconnecting remote client - name: remote-lrmd-ip-10-237-223-67:3121 id: b6034d3a-e296-492f-b296- 725735d17e22 Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Sep 27 17:19:35 [19626] ip-10-237-223-67 pacemaker_remoted:error: handle_new_connection:Error in connection setup (19626-21815-14): Remote I/O error (121) Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: ipc_proxy_accept: No ipc providers available for uid 0 gid 0 Sep 27 17:19:50 [19626] ip-10-237-223-67 pacemaker_remoted:error: handle_new_connection:Error in connection setup (19626-21815-14): Remote I/O error (121) [...] -- Best Regards, Radoslaw Garbacz XtremeData Incorporation ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org