Hi Vinícius, I don't have masses to contribute on this other than that we generally disable NetworkManager (why on earth would I want my server dynamically reconfiguring its network?) but I am curious about your bonding set-up. Am I right in thinking that you are setting up a bond for the primary nic (the one xCAT talks to)? It looks like the values for 'ip' and 'nicips.bond0' are the same. I never got that to work with the xCAT-supplied postscripts and had to write my own to do it, plus specifying some additional install-time kernel params and the relevant switch config. So does your way actually generate the correct ifcfg files? To me, your symptoms are consistent with an interface that is still dhcp-ing but receiving an empty dns config, or an ifcfg file with an empty "DNS=" param. And even with NetworkManager disabled I routinely add "PEERDNS=no" (and "DEFROUTE=no") to the nicextraparams.* setting for all secondary nics, though it is the route that usually bites me there. Could you share `ip -o a` and the ifcfg files? Jon
-- Dr. Jonathan Diprose <[email protected]<mailto:[email protected]>> Tel: 01865 287873 Research Computing Manager Henry Wellcome Building for Genomic Medicine Roosevelt Drive, Headington, Oxford OX3 7BN ________________________________ From: Vinícius Ferrão via xCAT-user [[email protected]] Sent: 16 June 2021 04:15 To: xCAT Users Mailing list Cc: Vinícius Ferrão Subject: Re: [xcat-user] /etc/resolv.conf missing nameserver on install nodes I was able to at lease top /etc/resolv.conf from being overwritten at every reboot with the following file: # cat /etc/NetworkManager/conf.d/90-dns-none.conf [main] dns=none I added this to the synclists and we are good about the /etc/resolv.conf isso. The conclusion is that NetworkManager was doing something wrong on /etc/resolv.conf. Although that was fixed with a hack there's consequences of it, the hostname of the machine is set as localhost.localdomain, and I don't know how to fix it. Is there any option in the node table to set the default hostname? So confignetwork can do it's job? # lsdef login Object name: login arch=x86_64 bmc=172.25.255.253 bmcpassword=calvin bmcusername=root cons=ipmi consoleenabled=1 currchain=boot currstate=install ol8.4.0-x86_64-compute groups=login,all ip=172.26.255.253 mac=2c:ea:7f:92:aa:d9 mgt=ipmi netboot=xnba nicdevices.bond0=ens1f0np0|ens1f1np1 nicdevices.bond0.1010=bond0 nichostnamesuffixes.bond0.1010=-ceph nicips.ib0=172.27.255.253 nicips.eno1=XXX.XXX.XXX.XXX nicips.bond0=172.26.255.253 nicips.bond0.1010=10.0.255.253 nicnetworks.ib0=application nicnetworks.eno1=site nicnetworks.bond0=management nicnetworks.bond0.1010=ceph nictypes.ens1f1np1=ethernet nictypes.bond0=bond nictypes.eno1=ethernet nictypes.ib0=Infiniband nictypes.bond0.1010=vlan nictypes.ens1f0np0=ethernet os=ol8.4.0 postbootscripts=otherpkgs,versatushpc/openpbs-login,versatushpc/fix-ohpc-login postscripts=syslog,remoteshell,syncfiles,confignetwork,versatushpc/postinstall-login profile=compute provmethod=ol8.4.0-x86_64-install-login serialport=0 serialspeed=115200 status=powering-on statustime=06-15-2021 16:29:52 updatestatus=failed updatestatustime=06-15-2021 16:27:27 Thanks, Vinícius. On 14 Jun 2021, at 13:48, Vinícius Ferrão via xCAT-user <[email protected]<mailto:[email protected]>> wrote: Hi Thomas, There's a pattern that I've found. When the compute node is simple enough it works, probably da data for resolv.conf is fetched directly from DHPC which should be configured correctly. The issue is around the nodes that have custom network schemes, like bonds and VLANs; it's something wrong during the confignetwork postscript. Probably due to a configuration mistake that I've made but I don't know which one. Regarding your questions: 1) It does not exist [root@ceph01-ib0 ~]# systemctl status systemd-networkd Unit systemd-networkd.service could not be found. 2) It's running [root@ceph01-ib0 ~]# systemctl status NetworkManager ● NetworkManager.service - Network Manager Loaded: loaded (/usr/lib/systemd/system/NetworkManager.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2021-06-14 13:37:20 -03; 8min ago Docs: man:NetworkManager(8) Main PID: 2028 (NetworkManager) Tasks: 3 (limit: 2464038) Memory: 11.4M CGroup: /system.slice/NetworkManager.service └─2028 /usr/sbin/NetworkManager --no-daemon 3) It does not exist: [root@ceph01-ib0 ~]# ls -l /etc/resolv.conf -rw-r--r-- 1 root root 65 Jun 14 13:37 /etc/resolv.conf [root@ceph01-ib0 ~]# ls -l /run/systemd/resolv/resolv.conf ls: cannot access '/run/systemd/resolv/resolv.conf': No such file or directory Cannot find anything related to rc-manager, is this a systemd thing? 4) No it's not. [root@ceph01-ib0 ~]# ls -l /etc/resolv.conf -rw-r--r-- 1 root root 65 Jun 14 13:37 /etc/resolv.conf 5) Seems default to me [root@ceph01-ib0 ~]# grep host /etc/nsswitch.conf # Valid databases are: aliases, ethers, group, gshadow, hosts, # myhostname Use systemd host names hosts: files dns myhostname That's it. It's probably something messy with confignetwork script, but not sure what. Thanks, On 14 Jun 2021, at 07:57, Thomas HUMMEL <[email protected]<mailto:[email protected]>> wrote: On 14/06/2021 07:41, Vinícius Ferrão via xCAT-user wrote: Hello, For unknown reasons nodes that I've installed with rinstall (using stateful method) didn't get the nameserver section in resolv.conf, basically leaving the node without any name resolution. Hello, assuming it is not an xCAT bug, I would look at 1) if systemd-networkd is enabled 2) if NetworkManager is enabled 3) if b) if it handles /etc/resolv.conf by looking at its conf and a) is dns= stated ? b) is /etc/resolv.conf a symlink to /run/systemd/resolv/resolv.conf ? c) is rc-manager stated ? 4) is /etc/resolv.conf a symlink to ../run/resolvconf/resolv.conf ? 5) the host line of /etc/nsswitch.conf to figure out who manages /etc/resolv.conf Hope it helps. -- Thomas HUMMEL rc-manager= As specified on the documentation https://xcat-docs.readthedocs.io/en/stable/advanced/domain_name_resolution/domain_name_resolution.html<https://urldefense.com/v3/__https://xcat-docs.readthedocs.io/en/stable/advanced/domain_name_resolution/domain_name_resolution.html__;!!JFdNOqOXpB6UZW0!91ZLw8JQX3n5Rscdto49z3zhxcPMupJEn1wtuLVOZFrMI5loio5BEgk3-82bVMwzYliuCA$>; it should be generated it nameservers and domain are provided on the site table: The resolv.conf files for the compute nodes will be created automatically using the domain and nameservers values set in the xCAT network or site definition. Both are defined but it still didn't generate it correctly. [root@headnode ~]# lsdef -t site clustersite | egrep "nameserver|forward|domain" domain=cluster.domain.tld forwarders=1.1.1.1 nameservers=172.26.255.254 I even tried adding the nameservers to the network definition, but it was a no go: [root@headnode ~]# lsdef -t network management Object name: management gateway=<xcatmaster> mask=255.255.0.0 mgtifname=bond0 mtu=1500 nameservers=172.26.255.254 net=172.26.0.0 tftpserver=<xcatmaster> Is there anything that I can do to debug this? Thanks, Vinícius. PS: Here's full data from a given node and the networks. [root@headnode ~]# lsdef ceph01 Object name: ceph01 arch=x86_64 bmc=172.25.254.1 bmcpassword=calvin bmcusername=root cons=ipmi consoleenabled=1 currchain=boot currstate=install ol8.4.0-x86_64-compute groups=ceph,all ip=172.26.254.1 mac=bc:97:e1:ea:08:b0 mgt=ipmi netboot=xnba nicdevices.bond0.123=bond0 nicdevices.bond0.1010=bond0 nicdevices.bond0=ens1f0np0|ens1f1np1 nichostnamesuffixes.bond0.1010=-ceph nichostnamesuffixes.bond0.123=-cephsync nicips.ib0=172.27.254.1 nicips.bond0=172.26.254.1 nicips.bond0.1010=10.0.10.21 nicips.bond0.123=192.168.168.21 nicnetworks.bond0.123=ceph-sync nicnetworks.ib0=application nicnetworks.bond0.1010=ceph nicnetworks.bond0=management nictypes.ib0=Infiniband nictypes.ens1f0np0=ethernet nictypes.bond0.1010=vlan nictypes.bond0=bond nictypes.ens1f1np1=ethernet nictypes.bond0.123=vlan os=ol8.4.0 postbootscripts=otherpkgs,confignics postscripts=syslog,remoteshell,syncfiles,confignetwork,versatushpc/postinstall-ceph profile=compute provmethod=ol8.4.0-x86_64-install-ceph serialport=0 serialspeed=115200 status=booted statustime=06-14-2021 02:37:04 updatestatus=synced updatestatustime=06-14-2021 02:01:55 [root@headnode ~]# lsdef -t network application (network) ceph (network) ceph-sync (network) libvirt (network) management (network) service (network) site (network) _______________________________________________ xCAT-user mailing list [email protected]<mailto:[email protected]> https://urldefense.com/v3/__https://lists.sourceforge.net/lists/listinfo/xcat-user__;!!JFdNOqOXpB6UZW0!91ZLw8JQX3n5Rscdto49z3zhxcPMupJEn1wtuLVOZFrMI5loio5BEgk3-82bVMxD4UfdFg$ _______________________________________________ xCAT-user mailing list [email protected]<mailto:[email protected]> https://lists.sourceforge.net/lists/listinfo/xcat-user _______________________________________________ xCAT-user mailing list [email protected]<mailto:[email protected]> https://lists.sourceforge.net/lists/listinfo/xcat-user
_______________________________________________ xCAT-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/xcat-user
