Re: [Gluster-infra] [Gluster-devel] Update on georep failure
Le mardi 02 février 2021 à 21:06 +0200, Yaniv Kaul a écrit : > On Tue, Feb 2, 2021 at 8:14 PM Michael Scherer > wrote: > > > Hi, > > > > so we finally found the cause of the georep failure, after several > > days > > of work from Deepshika and I. > > > > Short story: > > > > > > side effect of adding libtirpc-devel on EL 7: > > https://github.com/gluster/project-infrastructure/issues/115 > > > Looking at > https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/191 > - we > weren't supposed to use it? > From > https://github.com/gluster/glusterfs/blob/d1d7a6f35c816822fab51c820e25023863c239c1/glusterfs.spec.in#L61 > : > # Do not use libtirpc on EL6, it does not have xdr_uint64_t() and > xdr_uint32_t > # Do not use libtirpc on EL7, it does not have xdr_sizeof() > %if ( 0%{?rhel} && 0%{?rhel} <= 7 ) > %global _without_libtirpc --without-libtirpc > %endif > > > CentOS 7 has an ancient version, CentOS 8 has a newer version, so > perhaps > just one CentOS 8 slaves? Fine for me for C8, but if libtirpc on EL7 is missing a function (or more), how come the code compile without trouble, and fail at run time in a rather non obvious way ? looking at https://build.gluster.org/job/gh_centos7-regression/773/consoleFull it say: 13:53:23 Use TIRPC: yes -- Michael Scherer / He/Il/Er/Él Sysadmin, Community Infrastructure signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra
Re: [Gluster-infra] [Gluster-devel] Update on georep failure
On Tue, Feb 2, 2021 at 8:14 PM Michael Scherer wrote: > Hi, > > so we finally found the cause of the georep failure, after several days > of work from Deepshika and I. > > Short story: > > > side effect of adding libtirpc-devel on EL 7: > https://github.com/gluster/project-infrastructure/issues/115 Looking at https://github.com/gluster/glusterfs-patch-acceptance-tests/pull/191 - we weren't supposed to use it? From https://github.com/gluster/glusterfs/blob/d1d7a6f35c816822fab51c820e25023863c239c1/glusterfs.spec.in#L61 : # Do not use libtirpc on EL6, it does not have xdr_uint64_t() and xdr_uint32_t # Do not use libtirpc on EL7, it does not have xdr_sizeof() %if ( 0%{?rhel} && 0%{?rhel} <= 7 ) %global _without_libtirpc --without-libtirpc %endif CentOS 7 has an ancient version, CentOS 8 has a newer version, so perhaps just one CentOS 8 slaves? Y. > > Long story: > === > > So we first puzzled on why it was failing just on some builders and not > others, especially since it was working fine on softserve VMs. > > We tried to look for the usual suspect, rebooted, reinstalled, searched > if there was something weird (too much ssh keys, not enough inode, some > hardware issue), but nothing obvious. > > After trying to find my way in the logs file and a few weird leads > (like, why gsyncd was running gcc ? (answer: ctypes)), I was left with > a rather cryptic message: > > [2021-02-02 15:19:00.040817 +] I > [socket.c:929:__socket_server_bind] 0-socket.gfchangelog: closing > (AF_UNIX) reuse check socket 18 > [2021-02-02 15:19:02.041641 +] W [xdr- > rpcclnt.c:68:rpc_request_to_xdr] 0-rpc: failed to encode call msg > [2021-02-02 15:19:02.041673 +] E [rpc- > clnt.c:1507:rpc_clnt_record_build_record] 0-gfchangelog: Failed to > build record header > [2021-02-02 15:19:02.041683 +] W [rpc-clnt.c:1664:rpc_clnt_submit] > 0-gfchangelog: cannot build rpc-record > [2021-02-02 15:19:02.041692 +] E [MSGID: 132023] [gf- > changelog.c:285:gf_changelog_setup_rpc] 0-gfchangelog: Could not > initiate probe RPC, bailing out!!! > [2021-02-02 15:19:02.041809 +] E [MSGID: 132022] [gf- > changelog.c:583:gf_changelog_register_generic] 0-gfchangelog: Error > registering with changelog xlator > > Given that all gluster is around RPC, it would be unlikely that rpc is > broken, but that's the only messages we had. > > > We also found that the only builder that was working was builder 210. > Upon looking, we found that 210 failed to be updated with ansible, due > to some debugging we forgot to revert, which made this task fail: > > https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/gluster_qa_scripts/tasks/main.yml#L7 > > But it wasn't clear how that would change anything, since the only diff > was a "set -e" that wasn't removed. > > Then Deepshika started to test more than georep, and she noticed that a > lot of others tests were failing, with the same exact message about > rpc. > > And she started to wonder if anything was recently changed. And indeed: > > # rpm -qa --last | head -n 15 > yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch mar. 02 févr. > 2021 14:04:59 UTC > python3-debuginfo-3.6.8-18.el7.x86_64 mar. 02 févr. 2021 > 14:04:58 UTC > glibc-debuginfo-2.17-317.el7.x86_64 mar. 02 févr. 2021 > 14:04:57 UTC > glibc-debuginfo-common-2.17-317.el7.x86_64mar. 02 févr. 2021 > 14:04:53 UTC > gpg-pubkey-b6792c39-53c4fbdd mar. 02 févr. 2021 > 14:04:34 UTC > tzdata-java-2021a-1.el7.noarchmer. 27 janv. 2021 > 09:09:27 UTC > tzdata-2021a-1.el7.noarch mer. 27 janv. 2021 > 09:09:26 UTC > sudo-1.8.23-10.el7_9.1.x86_64 mer. 27 janv. 2021 > 09:09:26 UTC > libtirpc-devel-0.2.4-0.16.el7.x86_64 mar. 26 janv. 2021 > 12:53:45 UTC > java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 mar. 26 janv. 2021 > 05:06:44 UTC > > We added libtirpc-devel on the 26/01. > > libtirpc-devel would, as the name imply, change something around the > rpc subsystem. > > It happened around last week, when we started to notice the problem. > > It was not applied to 210, because 210 failed before it got to that > point (since ansible stop as soon as the git update failed, and jenkins > builder role is after the gluster-qa-script update). > > It was not applied to softserve provided VM either, so tests where > working fine there. > > And indeed, once the package got removed, the tests were working again. > > Follow up > = > > So, I would like to know exactly what should be tested. Is gluster not > compatible with libtirpc on C7 (as it work on C8), or is there some > weird issue ? (cause from what I remember, RPC format is supposed to be > compatible and covered by a specification) > > Should we test on C8 only ? > > > -- > Michael Scherer / He/Il/Er/Él > Sysadmin, Community Infrastructure > > > > --- > > Community Meeting Calendar: > Schedule - > Every 2nd and 4th Tuesday at 14:30 IST / 09:00
[Gluster-infra] Update on georep failure
Hi, so we finally found the cause of the georep failure, after several days of work from Deepshika and I. Short story: side effect of adding libtirpc-devel on EL 7: https://github.com/gluster/project-infrastructure/issues/115 Long story: === So we first puzzled on why it was failing just on some builders and not others, especially since it was working fine on softserve VMs. We tried to look for the usual suspect, rebooted, reinstalled, searched if there was something weird (too much ssh keys, not enough inode, some hardware issue), but nothing obvious. After trying to find my way in the logs file and a few weird leads (like, why gsyncd was running gcc ? (answer: ctypes)), I was left with a rather cryptic message: [2021-02-02 15:19:00.040817 +] I [socket.c:929:__socket_server_bind] 0-socket.gfchangelog: closing (AF_UNIX) reuse check socket 18 [2021-02-02 15:19:02.041641 +] W [xdr- rpcclnt.c:68:rpc_request_to_xdr] 0-rpc: failed to encode call msg [2021-02-02 15:19:02.041673 +] E [rpc- clnt.c:1507:rpc_clnt_record_build_record] 0-gfchangelog: Failed to build record header [2021-02-02 15:19:02.041683 +] W [rpc-clnt.c:1664:rpc_clnt_submit] 0-gfchangelog: cannot build rpc-record [2021-02-02 15:19:02.041692 +] E [MSGID: 132023] [gf- changelog.c:285:gf_changelog_setup_rpc] 0-gfchangelog: Could not initiate probe RPC, bailing out!!! [2021-02-02 15:19:02.041809 +] E [MSGID: 132022] [gf- changelog.c:583:gf_changelog_register_generic] 0-gfchangelog: Error registering with changelog xlator Given that all gluster is around RPC, it would be unlikely that rpc is broken, but that's the only messages we had. We also found that the only builder that was working was builder 210. Upon looking, we found that 210 failed to be updated with ansible, due to some debugging we forgot to revert, which made this task fail: https://github.com/gluster/gluster.org_ansible_configuration/blob/master/roles/gluster_qa_scripts/tasks/main.yml#L7 But it wasn't clear how that would change anything, since the only diff was a "set -e" that wasn't removed. Then Deepshika started to test more than georep, and she noticed that a lot of others tests were failing, with the same exact message about rpc. And she started to wonder if anything was recently changed. And indeed: # rpm -qa --last | head -n 15 yum-plugin-auto-update-debug-info-1.1.31-54.el7_8.noarch mar. 02 févr. 2021 14:04:59 UTC python3-debuginfo-3.6.8-18.el7.x86_64 mar. 02 févr. 2021 14:04:58 UTC glibc-debuginfo-2.17-317.el7.x86_64 mar. 02 févr. 2021 14:04:57 UTC glibc-debuginfo-common-2.17-317.el7.x86_64mar. 02 févr. 2021 14:04:53 UTC gpg-pubkey-b6792c39-53c4fbdd mar. 02 févr. 2021 14:04:34 UTC tzdata-java-2021a-1.el7.noarchmer. 27 janv. 2021 09:09:27 UTC tzdata-2021a-1.el7.noarch mer. 27 janv. 2021 09:09:26 UTC sudo-1.8.23-10.el7_9.1.x86_64 mer. 27 janv. 2021 09:09:26 UTC libtirpc-devel-0.2.4-0.16.el7.x86_64 mar. 26 janv. 2021 12:53:45 UTC java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64 mar. 26 janv. 2021 05:06:44 UTC We added libtirpc-devel on the 26/01. libtirpc-devel would, as the name imply, change something around the rpc subsystem. It happened around last week, when we started to notice the problem. It was not applied to 210, because 210 failed before it got to that point (since ansible stop as soon as the git update failed, and jenkins builder role is after the gluster-qa-script update). It was not applied to softserve provided VM either, so tests where working fine there. And indeed, once the package got removed, the tests were working again. Follow up = So, I would like to know exactly what should be tested. Is gluster not compatible with libtirpc on C7 (as it work on C8), or is there some weird issue ? (cause from what I remember, RPC format is supposed to be compatible and covered by a specification) Should we test on C8 only ? -- Michael Scherer / He/Il/Er/Él Sysadmin, Community Infrastructure signature.asc Description: This is a digitally signed message part ___ Gluster-infra mailing list Gluster-infra@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-infra