on the two second dhcp server, can u run `makedhcp -d 00:25:90:5a:eb:8a` ? we want to remove this mac address from the dhcp lease file on the server 130.246.32.86
Can u also check /var/log/console/proc01.log and search what is "Next server" ? Thanks, Casandra Qiu ................................................................... Casandra Hong Qiu Phone: (845) 433-9291, t/l 293-9291 Office: Building 8, 3-B-04 cxh...@us.ibm.com From: "Chiu, Peter (STFC,RAL,RALSP)" <peter.c...@stfc.ac.uk> To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net> Date: 08/05/2020 04:04 PM Subject: [EXTERNAL] Re: [xcat-user] xCAT 2.16 Centos 7 clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade Hello Casandra, Thanks for reply response. Below are the output of your suggestions. We do have a second DHCP server, but due to the small number of compute nodes, We have restricted our master server to serve a few nodes: 130.246.32.141 – 130.246.32.155. I have attached a copy of the dhcpd.conf here. I did enter the chdef command, and restarted the client. It still fails with PXE Boot aborted. Don’t think it has got far enough to allow me to ssh in. Any further thoughts, thanks. Regards, Peter 1. Xcatprobe catmn –I bond0 [root@main ~]# netstat –nr # to determine interface Kernel IP routing table Destination Gateway Genmask Flags MSS Window irtt Iface 0.0.0.0 130.246.32.254 0.0.0.0 UG 0 0 0 bond0 130.246.32.0 0.0.0.0 255.255.252.0 U 0 0 0 bond0 172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0 [root@main ~]# xcatprobe xcatmn -i bond0 [mn]: Checking all xCAT daemons are running... [ OK ] [mn]: Checking xcatd can receive command request... [ OK ] [mn]: Checking 'site' table is configured... [ OK ] [mn]: Checking provision network is configured... [ OK ] [mn]: Checking 'passwd' table is configured... [ OK ] [mn]: Checking important directories(installdir,tftpdir) are configured...[ OK ] [mn]: Checking SELinux is disabled... [ OK ] [mn]: Checking HTTP service is configured... [ OK ] [mn]: Checking TFTP service is configured... [ OK ] [mn]: Checking DNS service is configured... [ OK ] [mn]: Checking DHCP service is configured... [ OK ] [mn]: Checking NTP service is configured... [ OK ] [mn]: Checking rsyslog service is configured... [ OK ] [mn]: Checking firewall is disabled... [ OK ] [mn]: Checking minimum disk space for xCAT ['/var' needs 1GB;'/install'...[ OK ] [mn]: Checking Linux ulimits configuration... [ OK ] [mn]: Checking network kernel parameter configuration... [ OK ] [mn]: Checking xCAT daemon attributes configuration... [ OK ] [mn]: Checking xCAT log is stored in /var/log/xcat/cluster.log... [ OK ] [mn]: Checking xCAT management node IP: <130.246.32.140> is configured ...[ OK ] [mn]: Checking dhcpd.leases file is less than 100M... [ OK ] =================================== SUMMARY ===========================... [MN]: Checking on MN... [ OK ] [root@main ~]# df -hl /var Filesystem Size Used Avail Use% Mounted on /dev/sdc7 207G 80G 117G 41% / 2. Xcatprobe detect_dhcpd –i bond0 –m 00:25:90:5a:eb:8a [root@main ~]# xcatprobe detect_dhcpd -i bond0 -m 00:25:90:5a:eb:8a Start to detect DHCP, please wait 10 seconds [INFO] ++++++++++++++++++++++++++++++++++ [INFO] There are 2 servers replied to dhcp discover. [INFO] Server:130.246.32.140 assign IP [130.246.32.141]. The next server i...[INFO] Server:130.246.32.100 assign IP [130.246.32.86]. The next server is...[INFO] ++++++++++++++++++++++++++++++++++ [INFO] [root@main ~]# 3. The client host is registered in dns: [root@main ~]# nslookup proc01 Server: 130.246.188.240 Address: 130.246.188.240#53 Name: proc01.bnsc.rl.ac.uk Address: 130.246.32.141 4. Did try the chdef –t site clustersite xcatdebugmode=2 [root@main ~]# chdef -t site clustersite xcatdebugmode=2 1 object definitions have been created or modified. Restart the client proc01, but still fails with PXE Boot aborted. Client has not response to ping. [root@main ~]# ping 130.246.32.141 --- 130.246.32.141 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2000ms From: Casandra H Qiu <cxh...@us.ibm.com> Sent: 05 August 2020 20:15 To: xCAT Users Mailing list <xcat-user@lists.sourceforge.net> Subject: Re: [xcat-user] xCAT 2.16 Centos 7 clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade maybe try some verification command: xcatprobe xcatmn -i <provision interface> xcatprobe detect_dhcpd -i <provision interface> -m <CN's mac address> <<<<------ this is important, make sure no other server service this mac address nslookup <name of CN> you also can turn on debug mode, you may able to ssh to CN in the anaconda mode to do more debug chdef -t site clustersite xcatdebugmode=2 Thanks, Casandra Qiu ................................................................... Casandra Hong Qiu Phone: (845) 433-9291, t/l 293-9291 Office: Building 8, 3-B-04 cxh...@us.ibm.com Inactive hide details for "Chiu, Peter (STFC,RAL,RALSP)" ---08/05/2020 05:12:20 AM---Hello all, Having enjoyed a smooth running"Chiu, Peter (STFC,RAL,RALSP)" ---08/05/2020 05:12:20 AM---Hello all, Having enjoyed a smooth running of xCAT 2.16 on a small cluster of 4 compute nodes for ab From: "Chiu, Peter (STFC,RAL,RALSP)" <peter.c...@stfc.ac.uk> To: "xcat-user@lists.sourceforge.net" <xcat-user@lists.sourceforge.net> Date: 08/05/2020 05:12 AM Subject: [EXTERNAL] [xcat-user] xCAT 2.16 Centos 7 clients PXE Boot Aborted after 3.10.0-1127.18.2 upgrade Hello all, Having enjoyed a smooth running of xCAT 2.16 on a small cluster of 4 compute nodes for about 18 months, I have hit a problem on the last Centos 7 system update from 3.10.0-1127.13.1 to 3.10.0-1127.18.2 that resulted in all the clients not booting up with this error: CLIENT MAC ADDR: 00 25 90 5A EB BA GUID: 00000000 0025905AEBBA CLIENT IP: 130.246.32.141 MASK: 255.25.252.0 DNCP IP: 130.246.32.140 GATEWAY IP: 130.246.32.254 PXE Boot aborted. Booting to next device... PXE-M0F: Exiting Intel Boot Agent No such problem with the previous Centos 7 updates. Last successful update to 3.10.0-1127.13.1 on 24 June went through okay. I have attempted a number of checks and recoveries but no joy: a. confirm master can accept dhcp requests with DHCPNAK records in messages. b. confirm master can accept tftp downloads from a different system. c. confirm master can accept http downloads on kernel and ramdisk files. d. power-cycled master e. power-cycled clients f. manually lsdef -t osimage, chdef -t osimage, genimage, packimage, nodeset Unfortunately the above fault persists on all four compute nodes that are still down. I think I have run out of ideas. Before giving up and making a fresh XCAT installation, I wonder if anyone can shed some clues to trouble shoot PXE aborted errors. Many thanks. Peter Chiu STFC RAL Space, UK ============================================================================== Here are some details on the systems: Master node: main.bnsc.rl.ac.uk 130.246.32.140/22 gateway 130.246.32.254 Compute node1: proc01.bnsc.rl.ac.uk 130.246.32.141/22 00:25:90:5a:eb:8a Operating system: CentOS Linux release 7.8.2003 (Core) xCAT: # rpm -qf /opt/xcat/sbin/xcatd xCAT-server-2.16-snap202006161607.noarch Checks: a. DHCP records in master /var/log/messages, no error. The master server has picked up the dhcp requests, and offered the address. But no further communication afterwards. Aug 4 15:00:27 main dhcpd: DHCPDISCOVER from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:27 main dhcpd: DHCPOFFER on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: Dynamic and static leases present for 130.246.32.141. Aug 4 15:00:29 main dhcpd: Remove host declaration proc01 or remove 130.246.32.141 Aug 4 15:00:29 main dhcpd: from the dynamic address pool for bond0 Aug 4 15:00:29 main dhcpd: DHCPREQUEST for 130.246.32.141 (130.246.32.140) from 00:25:90:5a:eb:8a via bond0 Aug 4 15:00:29 main dhcpd: DHCPACK on 130.246.32.141 to 00:25:90:5a:eb:8a via bond0 b. /var/log/xcat/cluster.log No errors, just a record of a new image produced. Aug 4 14:24:54 main xcat[28101]: INFO xCAT: Allowing lsdef -t site -o clustersite -i installdir for root from localhost Aug 4 14:24:54 main xcat[28103]: INFO xCAT: Allowing genimage -i eth0 -n dca,ixgbe,igb,e1000e,e1000,tg3 -o centos7.6 -p compute --tempfile /tmp/xcat_genimage.28086 for root from localhost Aug 4 14:27:29 main xcat[25483]: INFO xCAT: Allowing packimage centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:27:30 main xcat[25499]: INFO xCAT: Allowing ilitefile centos7.6-x86_64-statelite-compute for root from localhost Aug 4 14:30:07 main xcat[26073]: INFO xCAT: Allowing nodeset to compute osimage=centos7.6-x86_64-netboot-compute for root from localhost Aug 4 14:34:33 main xcat[26958]: INFO xCAT: Allowing rpower to compute reset for root from localhost Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc03: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc04: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc01: changing status=powering-on Aug 4 14:34:33 main xcat[26959]: INFO xcat.updatestatus - proc02: changing status=powering-on c. Check dhcp lease file for the files to be downloaded: less /var/lib/dhcpd/dhcpd.leases host proc01.bnsc.rl.ac.uk { deleted; } host proc04.bnsc.rl.ac.uk { deleted; } host proc01 { dynamic; hardware ethernet 00:25:90:5a:eb:8a; uid 00:25:90:5a:eb:8a; fixed-address 130.246.32.141; supersede server.ddns-hostname = "proc01"; supersede host-name = "proc01"; if option user-class-identifier = "xNBA" and option client-architecture = 00:00 { supersede server.always-broadcast = 01; supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01"; } elsif option user-class-identifier = "xNBA" and option client-architecture = 00:09 { supersede server.filename = "http://${next-server}:80/tftpboot/xcat/xnba/nodes/proc01.uefi"; } elsif option client-architecture = 00:07 { supersede server.filename = "xcat/xnba.efi"; } elsif option client-architecture = 00:00 { supersede server.filename = "xcat/xnba.kpxe"; } else { supersede server.filename = ""; } } Follow through this list to download the files on a separate Centos server. d. tftp 130.236.32.140 [root@cds1 xcat]# tftp 130.246.32.140 tftp> get xcat/xnba.kpxe tftp> get xcat/xnba.efi tftp> get yaboot tftp> get xcat/xnba/nets/130.246.32.0_22 tftp> get xcat/xnba/nets/130.246.32.0_22.uefi tftp> quit [root@cds1 xcat]# ls 130.246.32.0_22 130.246.32.0_22.uefi elilo.efi xnba.efi xnba.kpxe yaboot [root@cds1 xcat]# ls -ls total 536 4 -rw-r--r-- 1 root root 252 Aug 4 09:46 130.246.32.0_22 4 -rw-r--r-- 1 root root 116 Aug 4 09:46 130.246.32.0_22.uefi 0 -rw-r--r-- 1 root root 0 Aug 4 09:45 elilo.efi 140 -rw-r--r-- 1 root root 139169 Aug 4 09:45 xnba.efi 80 -rw-r--r-- 1 root root 74786 Aug 4 09:45 xnba.kpxe 308 -rw-r--r-- 1 root root 310187 Aug 4 09:46 yaboot e. use wget to download the node start up file wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01 root@cds1 xcat]# wget http://130.246.32.140:80/tftpboot/xcat/xnba/nodes/proc01 --2020-08-04 11:57:18-- http://130.246.32.140/tftpboot/xcat/xnba/nodes/proc01 Connecting to 130.246.32.140:80... connected. HTTP request sent, awaiting response... 200 OK Length: 528 Saving to: `proc01' 100%[======================================>] 528 --.-K/s in 0s 2020-08-04 11:57:18 (85.2 MB/s) - `proc01' saved [528/528] f. This file in turn contains the instructions to download the kernel and ramdisk [root@cds1 xcat]# less proc01 #!gpxe #netboot centos7.6-x86_64-compute imgfetch -n kernel http://$ {next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/kernel imgload kernel imgargs kernel imgurl= http://130.246.32.140:80//install/netboot/centos7.6/x86_64/compute/rootimg.cpio.gz XCAT=130.246.32.140:3001 NODE=proc01 FC=yes XCATHTTPPORT=80 netdev=eth0 selinux=0 biosdevname=0 net.ifnames=0 BOOTIF=01-${netX/machyp} imgfetch http://$ {next-server}:80/tftpboot/xcat/osimage/centos7.6-x86_64-netboot-compute/initrd-stateless.gz imgexec kernel Both the kernel and ramdisk can also be downloaded using wget command. This email and any attachments are intended solely for the use of the named recipients. If you are not the intended recipient you must not use, disclose, copy or distribute this email or any of its attachments and should notify the sender immediately and delete this email from your system. UK Research and Innovation (UKRI) has taken every reasonable precaution to minimise risk of this email or any attachments containing viruses or malware but the recipient should carry out its own virus and malware checks before opening the attachments. UKRI does not accept any liability for any losses or damages which the recipient may sustain due to presence of any viruses. Opinions, conclusions or other information in this message and attachments that are not related directly to UKRI business are solely those of the author and do not represent the views of UKRI. _______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user [attachment "dhcpd.conf" deleted by Casandra H Qiu/Poughkeepsie/IBM] _______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.sourceforge.net_lists_listinfo_xcat-2Duser&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=n1LR_Py9TQX0dVqfGTbLHUMGx25-C8VtBDS0nCzyNXY&m=Zwf0mfiEv7ic1xaNcTVdOahR8f0f8_jB3vfABDDpTJg&s=jin-73XzXZYxPCYuE6pyJO6IvCJBrvnKLoAXZ9VJZMw&e=
_______________________________________________ xCAT-user mailing list xCAT-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/xcat-user