Re: [ovirt-devel] [vdsm] strange network test failure on FC23
On Fri, Nov 27, 2015 at 07:09:30PM +0100, David Caro wrote: > > I see though that is leaving a bunch of test interfaces in the slave: > > > 2753: vdsmtest-gNhf3:mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether 86:73:13:4c:e2:63 brd ff:ff:ff:ff:ff:ff > 2767: vdsmtest-aX5So: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether 9e:fa:75:3e:a3:e6 brd ff:ff:ff:ff:ff:ff > 2768: vdsmtest-crso1: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether 22:ce:cb:5c:42:3b brd ff:ff:ff:ff:ff:ff > 2772: vdsmtest-JDc5P: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether ae:79:dc:e9:22:9a brd ff:ff:ff:ff:ff:ff > > > > Can we do a cleanup in the tests and remove those? That might collide with > other tests and create failures. These bridges are no longer created on master (thanks to Nir's http://gerrit.ovirt.org/44111) The should have been removed by the run_tests that created them, but this may not take place if it is killed (or dies) beforehand. ___ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
Re: [ovirt-devel] [vdsm] strange network test failure on FC23
> On 29 Nov 2015, at 17:34, Nir Sofferwrote: > > On Sun, Nov 29, 2015 at 6:01 PM, Yaniv Kaul wrote: > > On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer wrote: > >> > >> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul wrote: > >> > > >> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani > >> > wrote: > >> >> > >> >> Using taskset, the ip command now takes a little longer to complete. I fail to find the original reference for this. Why does it take longer? is it purely the additional taskset executable invocation? On busy system we do have these issues all the time, with lvm, etc…so I don’t think it’s significant > >> > > >> > > >> > Since we always use the same set of CPUs, I assume using a mask (for 0 & > >> > 1, > >> > just use 0x3, as the man suggests) might be a tiny of a fraction faster > >> > to > >> > execute taskset with, instead of the need to translate the numeric CPU > >> > list. > >> > >> Creating the string "0-" is one line in vdsm. The code > >> handling this in > >> taskset is written in C, so the parsing time is practically zero. Even > >> if it was non-zero, > >> this code run once when we run a child process, so the cost is > >> insignificant. > > > > > > I think it's easier to just to have it as a mask in a config item somewhere, > > without need to create it or parse it anywhere. > > For us and for the user. > > We have this option in /etc/vdsm/vdsm.conf: > > # Comma separated whitelist of CPU cores on which VDSM is allowed to > # run. The default is "", meaning VDSM can be scheduled by the OS to > # run on any core. Valid examples: "1", "0,1", "0,2,3" > # cpu_affinity = 1 > > I think this is the easiest option for users. +1 > > >> > However, the real concern is making sure CPUs 0 & 1 are not really too > >> > busy > >> > with stuff (including interrupt handling, etc.) > >> > >> This code is used when we run a child process, to allow the child > >> process to run on > >> all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern > >> here. > >> > >> Vdsm itself is running by default on cpu 1, which should be less busy > >> then cpu 0. > > > > > > I assume those are cores, which probably in a multi-socket will be in the > > first socket only. > > There's a good chance that the FC and or network/cards will also bind their > > interrupts to core0 & core 1 (check /proc/interrupts) on the same socket. > > From my poor laptop (1s, 4c): > > 42:1487104 9329 4042 3598 IR-PCI-MSI 512000-edge > > :00:1f.2 > > > > (my SATA controller) > > > > 43: 14664923 34 18 13 IR-PCI-MSI 327680-edge > > xhci_hcd > > (my dock station connector) > > > > 45:6754579 4437 2501 2419 IR-PCI-MSI 32768-edge > > i915 > > (GPU) > > > > 47: 187409 11627 1235 1259 IR-PCI-MSI 2097152-edge > > iwlwifi > > (NIC, wifi) > > Interesting, here an example from a 8 cores machine running my vms: > > [nsoffer@jumbo ~]$ cat /proc/interrupts >CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 > CPU6 CPU7 > 0: 31 0 0 0 0 0 > 0 0 IR-IO-APIC-edge timer > 1: 2 0 0 1 0 0 > 0 0 IR-IO-APIC-edge i8042 > 8: 0 0 0 0 0 0 > 0 1 IR-IO-APIC-edge rtc0 > 9: 0 0 0 0 0 0 > 0 0 IR-IO-APIC-fasteoi acpi > 12: 3 0 0 0 0 0 > 1 0 IR-IO-APIC-edge i8042 > 16: 4 4 9 0 9 1 > 1 3 IR-IO-APIC 16-fasteoi ehci_hcd:usb3 > 23: 13 1 5 0 12 1 > 1 0 IR-IO-APIC 23-fasteoi ehci_hcd:usb4 > 24: 0 0 0 0 0 0 > 0 0 DMAR_MSI-edge dmar0 > 25: 0 0 0 0 0 0 > 0 0 DMAR_MSI-edge dmar1 > 26: 36703542159062370491124 > 169 54 IR-PCI-MSI-edge :00:1f.2 > 27: 0 0 0 0 0 0 > 0 0 IR-PCI-MSI-edge xhci_hcd > 28: 166285414 0 3 0 4 0 > 0 0 IR-PCI-MSI-edge em1 > 29: 18 0 0 0 4 3 > 0 0 IR-PCI-MSI-edge mei_me > 30: 1151 17 0 3169 > 26 94
Re: [ovirt-devel] [vdsm] strange network test failure on FC23
- Original Message - > From: "Michal Skrivanek" <mskri...@redhat.com> > To: "Nir Soffer" <nsof...@redhat.com>, "Francesco Romani" <from...@redhat.com> > Cc: "Yaniv Kaul" <yk...@redhat.com>, "infra" <infra@ovirt.org>, "devel" > <de...@ovirt.org> > Sent: Monday, November 30, 2015 9:52:59 AM > Subject: Re: [ovirt-devel] [vdsm] strange network test failure on FC23 > > > > On 29 Nov 2015, at 17:34, Nir Soffer <nsof...@redhat.com> wrote: > > > > On Sun, Nov 29, 2015 at 6:01 PM, Yaniv Kaul <yk...@redhat.com> wrote: > > > On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer <nsof...@redhat.com> wrote: > > >> > > >> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul <yk...@redhat.com> wrote: > > >> > > > >> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani <from...@redhat.com> > > >> > wrote: > > >> >> > > >> >> Using taskset, the ip command now takes a little longer to complete. > > I fail to find the original reference for this. > Why does it take longer? is it purely the additional taskset executable > invocation? On busy system we do have these issues all the time, with lvm, > etc…so I don’t think it’s significant Yep, that's only the overhead of taskset executable. > > >> > Since we always use the same set of CPUs, I assume using a mask (for 0 > > >> > & > > >> > 1, > > >> > just use 0x3, as the man suggests) might be a tiny of a fraction > > >> > faster > > >> > to > > >> > execute taskset with, instead of the need to translate the numeric CPU > > >> > list. > > >> > > >> Creating the string "0-" is one line in vdsm. The code > > >> handling this in > > >> taskset is written in C, so the parsing time is practically zero. Even > > >> if it was non-zero, > > >> this code run once when we run a child process, so the cost is > > >> insignificant. > > > > > > > > > I think it's easier to just to have it as a mask in a config item > > > somewhere, > > > without need to create it or parse it anywhere. > > > For us and for the user. > > > > We have this option in /etc/vdsm/vdsm.conf: > > > > # Comma separated whitelist of CPU cores on which VDSM is allowed to > > # run. The default is "", meaning VDSM can be scheduled by the OS to > > # run on any core. Valid examples: "1", "0,1", "0,2,3" > > # cpu_affinity = 1 > > > > I think this is the easiest option for users. > > +1 +1, modulo the changes we need to fix https://bugzilla.redhat.com/show_bug.cgi?id=1286462 (patch is coming) > > > I assume those are cores, which probably in a multi-socket will be in the > > > first socket only. > > > There's a good chance that the FC and or network/cards will also bind > > > their > > > interrupts to core0 & core 1 (check /proc/interrupts) on the same socket. > > > From my poor laptop (1s, 4c): Yes, especially core0 (since 0 is nice defaults). This was the rationale behind the choice of cpu #1 in the first place. > > It seems that our default (CPU1) is fine. > > I think it’s safe enough. > Numbers above (and I checked the same on ppc with similar pattern) are for a > reasonablt epty system. We can get a different picture when vdsm is busy. In > general I think it’s indeed best to use the second online CPU for vdsm and > all CPUs for child processes Agreed - except for cases like bz1286462 - but let's discuss this on gerrit/bz > regarding exposing to users in UI - I think that’s way too low level. > vdsm.conf is good enough Agreed. This is one thing that "just works". Bests, -- Francesco Romani RedHat Engineering Virtualization R & D Phone: 8261328 IRC: fromani ___ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
Re: [ovirt-devel] [vdsm] strange network test failure on FC23
Adding Dan, Ido On Fri, Nov 27, 2015 at 8:09 PM, David Carowrote: > > I see though that is leaving a bunch of test interfaces in the slave: > > > 2753: vdsmtest-gNhf3: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether 86:73:13:4c:e2:63 brd ff:ff:ff:ff:ff:ff > 2767: vdsmtest-aX5So: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether 9e:fa:75:3e:a3:e6 brd ff:ff:ff:ff:ff:ff > 2768: vdsmtest-crso1: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether 22:ce:cb:5c:42:3b brd ff:ff:ff:ff:ff:ff > 2772: vdsmtest-JDc5P: mtu 1500 qdisc > noqueue state UNKNOWN group default > link/ether ae:79:dc:e9:22:9a brd ff:ff:ff:ff:ff:ff > > > > Can we do a cleanup in the tests and remove those? That might collide with > other tests and create failures. > > On 11/27 19:07, David Caro wrote: >> >> I'm retriggering on another slave to see if it fails there too, might be env >> related. But that would not discard the race issue (as it might be just a >> coincidence or the slave a bit faster) >> >> On 11/27 11:55, Francesco Romani wrote: >> > Hi, >> > >> > Jenkins doesn't like my (trivial) https://gerrit.ovirt.org/#/c/49271/ >> > >> > which is about moving one log line (!). >> > >> > The failure is >> > >> > 00:08:54.680 >> > == >> > 00:08:54.680 FAIL: testEnablePromisc (ipwrapperTests.TestDrvinfo) >> > 00:08:54.680 >> > -- >> > 00:08:54.680 Traceback (most recent call last): >> > 00:08:54.680 File >> > "/home/jenkins/workspace/vdsm_master_check-patch-fc23-x86_64/vdsm/tests/ipwrapperTests.py", >> > line 130, in testEnablePromisc >> > 00:08:54.680 "Could not enable promiscuous mode.") >> > 00:08:54.680 AssertionError: Could not enable promiscuous mode. >> > 00:08:54.680 >> begin captured logging << >> > >> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /usr/sbin/brctl >> > show (cwd None) >> > 00:08:54.680 root: DEBUG: SUCCESS: = ''; = 0 >> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /sbin/ip link >> > add name vdsm-HIRjJp type bridge (cwd None) >> > 00:08:54.680 root: DEBUG: SUCCESS: = ''; = 0 >> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /sbin/ip link >> > set dev vdsm-HIRjJp up (cwd None) >> > 00:08:54.680 root: DEBUG: SUCCESS: = ''; = 0 >> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /sbin/ip link >> > set dev vdsm-HIRjJp promisc on (cwd None) >> > 00:08:54.680 root: DEBUG: SUCCESS: = ''; = 0 >> > 00:08:54.680 - >> end captured logging << >> > - >> > >> > >> > Here in fullest: >> > http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc23-x86_64/638/console >> > >> > The command like looks OK, and can't think any reason it could fail, >> > except startup race. >> > Using taskset, the ip command now takes a little longer to complete. >> > >> > Maybe -just wild guessing- the code isn't properly waiting for the command >> > to complete? >> > Otherwise not the slightest clue :) >> > >> > Bests, >> > >> > -- >> > Francesco Romani >> > RedHat Engineering Virtualization R & D >> > Phone: 8261328 >> > IRC: fromani >> > ___ >> > Infra mailing list >> > Infra@ovirt.org >> > http://lists.ovirt.org/mailman/listinfo/infra >> >> -- >> David Caro >> >> Red Hat S.L. >> Continuous Integration Engineer - EMEA ENG Virtualization R >> >> Tel.: +420 532 294 605 >> Email: dc...@redhat.com >> IRC: dcaro|dcaroest@{freenode|oftc|redhat} >> Web: www.redhat.com >> RHT Global #: 82-62605 > > > > -- > David Caro > > Red Hat S.L. > Continuous Integration Engineer - EMEA ENG Virtualization R > > Tel.: +420 532 294 605 > Email: dc...@redhat.com > IRC: dcaro|dcaroest@{freenode|oftc|redhat} > Web: www.redhat.com > RHT Global #: 82-62605 > > ___ > Devel mailing list > de...@ovirt.org > http://lists.ovirt.org/mailman/listinfo/devel ___ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
Re: [ovirt-devel] [vdsm] strange network test failure on FC23
On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaulwrote: > > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani > wrote: >> >> Using taskset, the ip command now takes a little longer to complete. > > > Since we always use the same set of CPUs, I assume using a mask (for 0 & 1, > just use 0x3, as the man suggests) might be a tiny of a fraction faster to > execute taskset with, instead of the need to translate the numeric CPU list. Creating the string "0-" is one line in vdsm. The code handling this in taskset is written in C, so the parsing time is practically zero. Even if it was non-zero, this code run once when we run a child process, so the cost is insignificant. > However, the real concern is making sure CPUs 0 & 1 are not really too busy > with stuff (including interrupt handling, etc.) This code is used when we run a child process, to allow the child process to run on all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern here. Vdsm itself is running by default on cpu 1, which should be less busy then cpu 0. The user can modify this configuration on the host, I guess we need to expose this on the engine side (cluster setting?). Also if vdsm is pinned to certain cpu, should user get a warning trying to pin a vm to this cpu? Michal, what do you think? Nir ___ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
Re: [ovirt-devel] [vdsm] strange network test failure on FC23
On Sun, Nov 29, 2015 at 5:37 PM, Nir Sofferwrote: > On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul wrote: > > > > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani > > wrote: > >> > >> Using taskset, the ip command now takes a little longer to complete. > > > > > > Since we always use the same set of CPUs, I assume using a mask (for 0 & > 1, > > just use 0x3, as the man suggests) might be a tiny of a fraction faster > to > > execute taskset with, instead of the need to translate the numeric CPU > list. > > Creating the string "0-" is one line in vdsm. The code > handling this in > taskset is written in C, so the parsing time is practically zero. Even > if it was non-zero, > this code run once when we run a child process, so the cost is > insignificant. > I think it's easier to just to have it as a mask in a config item somewhere, without need to create it or parse it anywhere. For us and for the user. > > However, the real concern is making sure CPUs 0 & 1 are not really too > busy > > with stuff (including interrupt handling, etc.) > > This code is used when we run a child process, to allow the child > process to run on > all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern > here. > > Vdsm itself is running by default on cpu 1, which should be less busy > then cpu 0. > I assume those are cores, which probably in a multi-socket will be in the first socket only. There's a good chance that the FC and or network/cards will also bind their interrupts to core0 & core 1 (check /proc/interrupts) on the same socket. >From my poor laptop (1s, 4c): 42:1487104 9329 4042 3598 IR-PCI-MSI 512000-edge :00:1f.2 (my SATA controller) 43: 14664923 34 18 13 IR-PCI-MSI 327680-edge xhci_hcd (my dock station connector) 45:6754579 4437 2501 2419 IR-PCI-MSI 32768-edge i915 (GPU) 47: 187409 11627 1235 1259 IR-PCI-MSI 2097152-edge iwlwifi (NIC, wifi) Y. > The user can modify this configuration on the host, I guess we need to > expose this > on the engine side (cluster setting?). > > Also if vdsm is pinned to certain cpu, should user get a warning > trying to pin a vm > to this cpu? > > Michal, what do you think? > > Nir > ___ Infra mailing list Infra@ovirt.org http://lists.ovirt.org/mailman/listinfo/infra
Re: [ovirt-devel] [vdsm] strange network test failure on FC23
On Sun, Nov 29, 2015 at 6:01 PM, Yaniv Kaulwrote: > On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer wrote: >> >> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul wrote: >> > >> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani >> > wrote: >> >> >> >> Using taskset, the ip command now takes a little longer to complete. >> > >> > >> > Since we always use the same set of CPUs, I assume using a mask (for 0 & >> > 1, >> > just use 0x3, as the man suggests) might be a tiny of a fraction faster >> > to >> > execute taskset with, instead of the need to translate the numeric CPU >> > list. >> >> Creating the string "0-" is one line in vdsm. The code >> handling this in >> taskset is written in C, so the parsing time is practically zero. Even >> if it was non-zero, >> this code run once when we run a child process, so the cost is >> insignificant. > > > I think it's easier to just to have it as a mask in a config item somewhere, > without need to create it or parse it anywhere. > For us and for the user. We have this option in /etc/vdsm/vdsm.conf: # Comma separated whitelist of CPU cores on which VDSM is allowed to # run. The default is "", meaning VDSM can be scheduled by the OS to # run on any core. Valid examples: "1", "0,1", "0,2,3" # cpu_affinity = 1 I think this is the easiest option for users. >> > However, the real concern is making sure CPUs 0 & 1 are not really too >> > busy >> > with stuff (including interrupt handling, etc.) >> >> This code is used when we run a child process, to allow the child >> process to run on >> all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern >> here. >> >> Vdsm itself is running by default on cpu 1, which should be less busy >> then cpu 0. > > > I assume those are cores, which probably in a multi-socket will be in the > first socket only. > There's a good chance that the FC and or network/cards will also bind their > interrupts to core0 & core 1 (check /proc/interrupts) on the same socket. > From my poor laptop (1s, 4c): > 42:1487104 9329 4042 3598 IR-PCI-MSI 512000-edge > :00:1f.2 > > (my SATA controller) > > 43: 14664923 34 18 13 IR-PCI-MSI 327680-edge > xhci_hcd > (my dock station connector) > > 45:6754579 4437 2501 2419 IR-PCI-MSI 32768-edge > i915 > (GPU) > > 47: 187409 11627 1235 1259 IR-PCI-MSI 2097152-edge > iwlwifi > (NIC, wifi) Interesting, here an example from a 8 cores machine running my vms: [nsoffer@jumbo ~]$ cat /proc/interrupts CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 0: 31 0 0 0 0 0 0 0 IR-IO-APIC-edge timer 1: 2 0 0 1 0 0 0 0 IR-IO-APIC-edge i8042 8: 0 0 0 0 0 0 0 1 IR-IO-APIC-edge rtc0 9: 0 0 0 0 0 0 0 0 IR-IO-APIC-fasteoi acpi 12: 3 0 0 0 0 0 1 0 IR-IO-APIC-edge i8042 16: 4 4 9 0 9 1 1 3 IR-IO-APIC 16-fasteoi ehci_hcd:usb3 23: 13 1 5 0 12 1 1 0 IR-IO-APIC 23-fasteoi ehci_hcd:usb4 24: 0 0 0 0 0 0 0 0 DMAR_MSI-edge dmar0 25: 0 0 0 0 0 0 0 0 DMAR_MSI-edge dmar1 26: 36703542159062370491124 169 54 IR-PCI-MSI-edge :00:1f.2 27: 0 0 0 0 0 0 0 0 IR-PCI-MSI-edge xhci_hcd 28: 166285414 0 3 0 4 0 0 0 IR-PCI-MSI-edge em1 29: 18 0 0 0 4 3 0 0 IR-PCI-MSI-edge mei_me 30: 1151 17 0 3169 26 94 IR-PCI-MSI-edge snd_hda_intel NMI: 2508 2296 2317 2356867918 912903 Non-maskable interrupts LOC: 302996116 312923350 312295375 312089303 86282447 94046427 90847792 91761277 Local timer interrupts SPU: 0 0 0 0 0 0 0 0 Spurious interrupts PMI: 2508 2296 2317 2356867918 912903 Performance monitoring interrupts IWI: 1 0 0 5 0 0 0 0 IRQ work interrupts RTR: 0