Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-12-03 Thread Dan Kenigsberg
On Fri, Nov 27, 2015 at 07:09:30PM +0100, David Caro wrote:
> 
> I see though that is leaving a bunch of test interfaces in the slave:
> 
> 
> 2753: vdsmtest-gNhf3:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default 
> link/ether 86:73:13:4c:e2:63 brd ff:ff:ff:ff:ff:ff
> 2767: vdsmtest-aX5So:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default 
> link/ether 9e:fa:75:3e:a3:e6 brd ff:ff:ff:ff:ff:ff
> 2768: vdsmtest-crso1:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default 
> link/ether 22:ce:cb:5c:42:3b brd ff:ff:ff:ff:ff:ff
> 2772: vdsmtest-JDc5P:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default 
> link/ether ae:79:dc:e9:22:9a brd ff:ff:ff:ff:ff:ff
> 
> 
> 
> Can we do a cleanup in the tests and remove those? That might collide with
> other tests and create failures.

These bridges are no longer created on master (thanks to Nir's
http://gerrit.ovirt.org/44111)
The should have been removed by the run_tests that created them, but
this may not take place if it is killed (or dies) beforehand.
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-11-30 Thread Michal Skrivanek

> On 29 Nov 2015, at 17:34, Nir Soffer  wrote:
> 
> On Sun, Nov 29, 2015 at 6:01 PM, Yaniv Kaul  wrote:
> > On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer  wrote:
> >>
> >> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul  wrote:
> >> >
> >> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani 
> >> > wrote:
> >> >>
> >> >> Using taskset, the ip command now takes a little longer to complete.

I fail to find the original reference for this.
Why does it take longer? is it purely the additional taskset executable 
invocation? On busy system we do have these issues all the time, with lvm, 
etc…so I don’t think it’s significant


> >> >
> >> >
> >> > Since we always use the same set of CPUs, I assume using a mask (for 0 &
> >> > 1,
> >> > just use 0x3, as the man suggests) might be a tiny of a fraction faster
> >> > to
> >> > execute taskset with, instead of the need to translate the numeric CPU
> >> > list.
> >>
> >> Creating the string "0-" is one line in vdsm. The code
> >> handling this in
> >> taskset is written in C, so the parsing time is practically zero. Even
> >> if it was non-zero,
> >> this code run once when we run a child process, so the cost is
> >> insignificant.
> >
> >
> > I think it's easier to just to have it as a mask in a config item somewhere,
> > without need to create it or parse it anywhere.
> > For us and for the user.
> 
> We have this option in /etc/vdsm/vdsm.conf:
> 
> # Comma separated whitelist of CPU cores on which VDSM is allowed to
> # run. The default is "", meaning VDSM can be scheduled by  the OS to
> # run on any core. Valid examples: "1", "0,1", "0,2,3"
> # cpu_affinity = 1
> 
> I think this is the easiest option for users.

+1

> 
> >> > However, the real concern is making sure CPUs 0 & 1 are not really too
> >> > busy
> >> > with stuff (including interrupt handling, etc.)
> >>
> >> This code is used when we run a child process, to allow the child
> >> process to run on
> >> all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern
> >> here.
> >>
> >> Vdsm itself is running by default on cpu 1, which should be less busy
> >> then cpu 0.
> >
> >
> > I assume those are cores, which probably in a multi-socket will be in the
> > first socket only.
> > There's a good chance that the FC and or network/cards will also bind their
> > interrupts to core0 & core 1 (check /proc/interrupts) on the same socket.
> > From my poor laptop (1s, 4c):
> > 42:1487104   9329   4042   3598  IR-PCI-MSI 512000-edge
> > :00:1f.2
> >
> > (my SATA controller)
> >
> > 43:   14664923 34 18 13  IR-PCI-MSI 327680-edge
> > xhci_hcd
> > (my dock station connector)
> >
> > 45:6754579   4437   2501   2419  IR-PCI-MSI 32768-edge
> > i915
> > (GPU)
> >
> > 47: 187409  11627   1235   1259  IR-PCI-MSI 2097152-edge
> > iwlwifi
> > (NIC, wifi)
> 
> Interesting, here an example from a 8 cores machine running my vms:
> 
> [nsoffer@jumbo ~]$ cat /proc/interrupts 
>CPU0   CPU1   CPU2   CPU3   CPU4   CPU5   
> CPU6   CPU7   
>   0: 31  0  0  0  0  0
>   0  0  IR-IO-APIC-edge  timer
>   1:  2  0  0  1  0  0
>   0  0  IR-IO-APIC-edge  i8042
>   8:  0  0  0  0  0  0
>   0  1  IR-IO-APIC-edge  rtc0
>   9:  0  0  0  0  0  0
>   0  0  IR-IO-APIC-fasteoi   acpi
>  12:  3  0  0  0  0  0
>   1  0  IR-IO-APIC-edge  i8042
>  16:  4  4  9  0  9  1
>   1  3  IR-IO-APIC  16-fasteoi   ehci_hcd:usb3
>  23: 13  1  5  0 12  1
>   1  0  IR-IO-APIC  23-fasteoi   ehci_hcd:usb4
>  24:  0  0  0  0  0  0
>   0  0  DMAR_MSI-edge  dmar0
>  25:  0  0  0  0  0  0
>   0  0  DMAR_MSI-edge  dmar1
>  26:   36703542159062370491124
> 169 54  IR-PCI-MSI-edge  :00:1f.2
>  27:  0  0  0  0  0  0
>   0  0  IR-PCI-MSI-edge  xhci_hcd
>  28:  166285414  0  3  0  4  0
>   0  0  IR-PCI-MSI-edge  em1
>  29: 18  0  0  0  4  3
>   0  0  IR-PCI-MSI-edge  mei_me
>  30:  1151 17  0  3169
>  26 94  

Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-11-30 Thread Francesco Romani
- Original Message -
> From: "Michal Skrivanek" <mskri...@redhat.com>
> To: "Nir Soffer" <nsof...@redhat.com>, "Francesco Romani" <from...@redhat.com>
> Cc: "Yaniv Kaul" <yk...@redhat.com>, "infra" <infra@ovirt.org>, "devel" 
> <de...@ovirt.org>
> Sent: Monday, November 30, 2015 9:52:59 AM
> Subject: Re: [ovirt-devel] [vdsm] strange network test failure on FC23
> 
> 
> > On 29 Nov 2015, at 17:34, Nir Soffer <nsof...@redhat.com> wrote:
> > 
> > On Sun, Nov 29, 2015 at 6:01 PM, Yaniv Kaul <yk...@redhat.com> wrote:
> > > On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer <nsof...@redhat.com> wrote:
> > >>
> > >> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul <yk...@redhat.com> wrote:
> > >> >
> > >> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani <from...@redhat.com>
> > >> > wrote:
> > >> >>
> > >> >> Using taskset, the ip command now takes a little longer to complete.
> 
> I fail to find the original reference for this.
> Why does it take longer? is it purely the additional taskset executable
> invocation? On busy system we do have these issues all the time, with lvm,
> etc…so I don’t think it’s significant

Yep, that's only the overhead of taskset executable.
 
> > >> > Since we always use the same set of CPUs, I assume using a mask (for 0
> > >> > &
> > >> > 1,
> > >> > just use 0x3, as the man suggests) might be a tiny of a fraction
> > >> > faster
> > >> > to
> > >> > execute taskset with, instead of the need to translate the numeric CPU
> > >> > list.
> > >>
> > >> Creating the string "0-" is one line in vdsm. The code
> > >> handling this in
> > >> taskset is written in C, so the parsing time is practically zero. Even
> > >> if it was non-zero,
> > >> this code run once when we run a child process, so the cost is
> > >> insignificant.
> > >
> > >
> > > I think it's easier to just to have it as a mask in a config item
> > > somewhere,
> > > without need to create it or parse it anywhere.
> > > For us and for the user.
> > 
> > We have this option in /etc/vdsm/vdsm.conf:
> > 
> > # Comma separated whitelist of CPU cores on which VDSM is allowed to
> > # run. The default is "", meaning VDSM can be scheduled by  the OS to
> > # run on any core. Valid examples: "1", "0,1", "0,2,3"
> > # cpu_affinity = 1
> > 
> > I think this is the easiest option for users.
> 
> +1

+1, modulo the changes we need to fix 
https://bugzilla.redhat.com/show_bug.cgi?id=1286462
(patch is coming)
 
> > > I assume those are cores, which probably in a multi-socket will be in the
> > > first socket only.
> > > There's a good chance that the FC and or network/cards will also bind
> > > their
> > > interrupts to core0 & core 1 (check /proc/interrupts) on the same socket.
> > > From my poor laptop (1s, 4c):

Yes, especially core0 (since 0 is nice defaults). This was the rationale behind
the choice of cpu #1 in the first place.

> > It seems that our default (CPU1) is fine.
> 
> I think it’s safe enough.
> Numbers above (and I checked the same on ppc with similar pattern) are for a
> reasonablt epty system. We can get a different picture when vdsm is busy. In
> general I think it’s indeed best to use the second online CPU for vdsm and
> all CPUs for child processes

Agreed - except for cases like bz1286462 - but let's discuss this on gerrit/bz

> regarding exposing to users in UI - I think that’s way too low level.
> vdsm.conf is good enough

Agreed. This is one thing that "just works".

Bests,

-- 
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-11-29 Thread Nir Soffer
Adding Dan, Ido

On Fri, Nov 27, 2015 at 8:09 PM, David Caro  wrote:
>
> I see though that is leaving a bunch of test interfaces in the slave:
>
>
> 2753: vdsmtest-gNhf3:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default
> link/ether 86:73:13:4c:e2:63 brd ff:ff:ff:ff:ff:ff
> 2767: vdsmtest-aX5So:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default
> link/ether 9e:fa:75:3e:a3:e6 brd ff:ff:ff:ff:ff:ff
> 2768: vdsmtest-crso1:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default
> link/ether 22:ce:cb:5c:42:3b brd ff:ff:ff:ff:ff:ff
> 2772: vdsmtest-JDc5P:  mtu 1500 qdisc 
> noqueue state UNKNOWN group default
> link/ether ae:79:dc:e9:22:9a brd ff:ff:ff:ff:ff:ff
>
>
>
> Can we do a cleanup in the tests and remove those? That might collide with
> other tests and create failures.
>
> On 11/27 19:07, David Caro wrote:
>>
>> I'm retriggering on another slave to see if it fails there too, might be env
>> related. But that would not discard the race issue (as it might be just a
>> coincidence or the slave a bit faster)
>>
>> On 11/27 11:55, Francesco Romani wrote:
>> > Hi,
>> >
>> > Jenkins doesn't like my (trivial) https://gerrit.ovirt.org/#/c/49271/
>> >
>> > which is about moving one log line (!).
>> >
>> > The failure is
>> >
>> > 00:08:54.680 
>> > ==
>> > 00:08:54.680 FAIL: testEnablePromisc (ipwrapperTests.TestDrvinfo)
>> > 00:08:54.680 
>> > --
>> > 00:08:54.680 Traceback (most recent call last):
>> > 00:08:54.680   File 
>> > "/home/jenkins/workspace/vdsm_master_check-patch-fc23-x86_64/vdsm/tests/ipwrapperTests.py",
>> >  line 130, in testEnablePromisc
>> > 00:08:54.680 "Could not enable promiscuous mode.")
>> > 00:08:54.680 AssertionError: Could not enable promiscuous mode.
>> > 00:08:54.680  >> begin captured logging << 
>> > 
>> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /usr/sbin/brctl 
>> > show (cwd None)
>> > 00:08:54.680 root: DEBUG: SUCCESS:  = '';  = 0
>> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /sbin/ip link 
>> > add name vdsm-HIRjJp type bridge (cwd None)
>> > 00:08:54.680 root: DEBUG: SUCCESS:  = '';  = 0
>> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /sbin/ip link 
>> > set dev vdsm-HIRjJp up (cwd None)
>> > 00:08:54.680 root: DEBUG: SUCCESS:  = '';  = 0
>> > 00:08:54.680 root: DEBUG: /usr/bin/taskset --cpu-list 0-1 /sbin/ip link 
>> > set dev vdsm-HIRjJp promisc on (cwd None)
>> > 00:08:54.680 root: DEBUG: SUCCESS:  = '';  = 0
>> > 00:08:54.680 - >> end captured logging << 
>> > -
>> >
>> >
>> > Here in fullest: 
>> > http://jenkins.ovirt.org/job/vdsm_master_check-patch-fc23-x86_64/638/console
>> >
>> > The command like looks OK, and can't think any reason it could fail, 
>> > except startup race.
>> > Using taskset, the ip command now takes a little longer to complete.
>> >
>> > Maybe -just wild guessing- the code isn't properly waiting for the command 
>> > to complete?
>> > Otherwise not the slightest clue :)
>> >
>> > Bests,
>> >
>> > --
>> > Francesco Romani
>> > RedHat Engineering Virtualization R & D
>> > Phone: 8261328
>> > IRC: fromani
>> > ___
>> > Infra mailing list
>> > Infra@ovirt.org
>> > http://lists.ovirt.org/mailman/listinfo/infra
>>
>> --
>> David Caro
>>
>> Red Hat S.L.
>> Continuous Integration Engineer - EMEA ENG Virtualization R
>>
>> Tel.: +420 532 294 605
>> Email: dc...@redhat.com
>> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
>> Web: www.redhat.com
>> RHT Global #: 82-62605
>
>
>
> --
> David Caro
>
> Red Hat S.L.
> Continuous Integration Engineer - EMEA ENG Virtualization R
>
> Tel.: +420 532 294 605
> Email: dc...@redhat.com
> IRC: dcaro|dcaroest@{freenode|oftc|redhat}
> Web: www.redhat.com
> RHT Global #: 82-62605
>
> ___
> Devel mailing list
> de...@ovirt.org
> http://lists.ovirt.org/mailman/listinfo/devel
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-11-29 Thread Nir Soffer
On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul  wrote:
>
> On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani 
> wrote:
>>
>> Using taskset, the ip command now takes a little longer to complete.
>
>
> Since we always use the same set of CPUs, I assume using a mask (for 0 & 1,
> just use 0x3, as the man suggests) might be a tiny of a fraction faster to
> execute taskset with, instead of the need to translate the numeric CPU list.

Creating the string "0-" is one line in vdsm. The code
handling this in
taskset is written in C, so the parsing time is practically zero. Even
if it was non-zero,
this code run once when we run a child process, so the cost is insignificant.

> However, the real concern is making sure CPUs 0 & 1 are not really too busy
> with stuff (including interrupt handling, etc.)

This code is used when we run a child process, to allow the child
process to run on
all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern here.

Vdsm itself is running by default on cpu 1, which should be less busy
then cpu 0.

The user can modify this configuration on the host, I guess we need to
expose this
on the engine side (cluster setting?).

Also if vdsm is pinned to certain cpu, should user get a warning
trying to pin a vm
to this cpu?

Michal, what do you think?

Nir
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-11-29 Thread Yaniv Kaul
On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer  wrote:

> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul  wrote:
> >
> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani 
> > wrote:
> >>
> >> Using taskset, the ip command now takes a little longer to complete.
> >
> >
> > Since we always use the same set of CPUs, I assume using a mask (for 0 &
> 1,
> > just use 0x3, as the man suggests) might be a tiny of a fraction faster
> to
> > execute taskset with, instead of the need to translate the numeric CPU
> list.
>
> Creating the string "0-" is one line in vdsm. The code
> handling this in
> taskset is written in C, so the parsing time is practically zero. Even
> if it was non-zero,
> this code run once when we run a child process, so the cost is
> insignificant.
>

I think it's easier to just to have it as a mask in a config item
somewhere, without need to create it or parse it anywhere.
For us and for the user.


> > However, the real concern is making sure CPUs 0 & 1 are not really too
> busy
> > with stuff (including interrupt handling, etc.)
>
> This code is used when we run a child process, to allow the child
> process to run on
> all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern
> here.
>
> Vdsm itself is running by default on cpu 1, which should be less busy
> then cpu 0.
>

I assume those are cores, which probably in a multi-socket will be in the
first socket only.
There's a good chance that the FC and or network/cards will also bind their
 interrupts to core0 & core 1 (check /proc/interrupts) on the same socket.
>From my poor laptop (1s, 4c):
42:1487104   9329   4042   3598  IR-PCI-MSI 512000-edge
 :00:1f.2

(my SATA controller)

43:   14664923 34 18 13  IR-PCI-MSI 327680-edge
 xhci_hcd
(my dock station connector)

45:6754579   4437   2501   2419  IR-PCI-MSI 32768-edge
 i915
(GPU)

47: 187409  11627   1235   1259  IR-PCI-MSI 2097152-edge
   iwlwifi
(NIC, wifi)

Y.



> The user can modify this configuration on the host, I guess we need to
> expose this
> on the engine side (cluster setting?).
>
> Also if vdsm is pinned to certain cpu, should user get a warning
> trying to pin a vm
> to this cpu?
>
> Michal, what do you think?
>
> Nir
>
___
Infra mailing list
Infra@ovirt.org
http://lists.ovirt.org/mailman/listinfo/infra


Re: [ovirt-devel] [vdsm] strange network test failure on FC23

2015-11-29 Thread Nir Soffer
On Sun, Nov 29, 2015 at 6:01 PM, Yaniv Kaul  wrote:
> On Sun, Nov 29, 2015 at 5:37 PM, Nir Soffer  wrote:
>>
>> On Sun, Nov 29, 2015 at 10:37 AM, Yaniv Kaul  wrote:
>> >
>> > On Fri, Nov 27, 2015 at 6:55 PM, Francesco Romani 
>> > wrote:
>> >>
>> >> Using taskset, the ip command now takes a little longer to complete.
>> >
>> >
>> > Since we always use the same set of CPUs, I assume using a mask (for 0
&
>> > 1,
>> > just use 0x3, as the man suggests) might be a tiny of a fraction faster
>> > to
>> > execute taskset with, instead of the need to translate the numeric CPU
>> > list.
>>
>> Creating the string "0-" is one line in vdsm. The code
>> handling this in
>> taskset is written in C, so the parsing time is practically zero. Even
>> if it was non-zero,
>> this code run once when we run a child process, so the cost is
>> insignificant.
>
>
> I think it's easier to just to have it as a mask in a config item
somewhere,
> without need to create it or parse it anywhere.
> For us and for the user.

We have this option in /etc/vdsm/vdsm.conf:

# Comma separated whitelist of CPU cores on which VDSM is allowed to
# run. The default is "", meaning VDSM can be scheduled by  the OS to
# run on any core. Valid examples: "1", "0,1", "0,2,3"
# cpu_affinity = 1

I think this is the easiest option for users.

>> > However, the real concern is making sure CPUs 0 & 1 are not really too
>> > busy
>> > with stuff (including interrupt handling, etc.)
>>
>> This code is used when we run a child process, to allow the child
>> process to run on
>> all cpus (in this case, cpu 0 and cpu 1). So I think there is no concern
>> here.
>>
>> Vdsm itself is running by default on cpu 1, which should be less busy
>> then cpu 0.
>
>
> I assume those are cores, which probably in a multi-socket will be in the
> first socket only.
> There's a good chance that the FC and or network/cards will also bind
their
> interrupts to core0 & core 1 (check /proc/interrupts) on the same socket.
> From my poor laptop (1s, 4c):
> 42:1487104   9329   4042   3598  IR-PCI-MSI 512000-edge

> :00:1f.2
>
> (my SATA controller)
>
> 43:   14664923 34 18 13  IR-PCI-MSI 327680-edge

> xhci_hcd
> (my dock station connector)
>
> 45:6754579   4437   2501   2419  IR-PCI-MSI 32768-edge
> i915
> (GPU)
>
> 47: 187409  11627   1235   1259  IR-PCI-MSI 2097152-edge

> iwlwifi
> (NIC, wifi)

Interesting, here an example from a 8 cores machine running my vms:

[nsoffer@jumbo ~]$ cat /proc/interrupts
   CPU0   CPU1   CPU2   CPU3   CPU4   CPU5
  CPU6   CPU7
  0: 31  0  0  0  0  0
 0  0  IR-IO-APIC-edge  timer
  1:  2  0  0  1  0  0
 0  0  IR-IO-APIC-edge  i8042
  8:  0  0  0  0  0  0
 0  1  IR-IO-APIC-edge  rtc0
  9:  0  0  0  0  0  0
 0  0  IR-IO-APIC-fasteoi   acpi
 12:  3  0  0  0  0  0
 1  0  IR-IO-APIC-edge  i8042
 16:  4  4  9  0  9  1
 1  3  IR-IO-APIC  16-fasteoi   ehci_hcd:usb3
 23: 13  1  5  0 12  1
 1  0  IR-IO-APIC  23-fasteoi   ehci_hcd:usb4
 24:  0  0  0  0  0  0
 0  0  DMAR_MSI-edge  dmar0
 25:  0  0  0  0  0  0
 0  0  DMAR_MSI-edge  dmar1
 26:   36703542159062370491124
   169 54  IR-PCI-MSI-edge  :00:1f.2
 27:  0  0  0  0  0  0
 0  0  IR-PCI-MSI-edge  xhci_hcd
 28:  166285414  0  3  0  4  0
 0  0  IR-PCI-MSI-edge  em1
 29: 18  0  0  0  4  3
 0  0  IR-PCI-MSI-edge  mei_me
 30:  1151 17  0  3169
26 94  IR-PCI-MSI-edge  snd_hda_intel
NMI:   2508   2296   2317   2356867918
   912903   Non-maskable interrupts
LOC:  302996116  312923350  312295375  312089303   86282447   94046427
90847792   91761277   Local timer interrupts
SPU:  0  0  0  0  0  0
 0  0   Spurious interrupts
PMI:   2508   2296   2317   2356867918
   912903   Performance monitoring interrupts
IWI:  1  0  0  5  0  0
 0  0   IRQ work interrupts
RTR:  0