Re: [openstack-dev] [Trove] Trove-Gate timeouts
Hi Nikhil, This is great! You may have assumed this but I wanted to be explicit: To get the higher timeout, I think all that needs to be done is to upgrade the boot-hpcloud-vm plugin. Then the plugin will pass the current timeout (7200 in rdjenkins) down to the net-ssl-simple library. (This is all based on the assumption that rdjenkins is running an older version of the plugin.) Thanks, Mat On 2/17/14, 10:13 PM, Nikhil Manchanda nik...@manchanda.me wrote: Hi Mathew: Nice work identifying these issues with the current Jenkins integration tests! I'm looking into some of these issues at the moment and am attempting to ensure that the appropriate fixes are merged so that the Trove integration gate job is healthy again. Some of my comments are inline. Lowery, Mathew writes: Hi all, Issue #1: Jobs that need more than one hour [...] Suggested action items: * If it is acceptable to have jobs that run over one hour, then install the latest boot-hpcloud-vm plugin for Jenkins which will increase the make the operation timeout match the idle timeout. Turns out that some of the configuration parameters changes checked in pushed the gate job above this timeout threshold. For the time being, I suggest we increase the timeout value of the gate job to cater to this increase. However, we also need to take a closer look at our test suite, to see how we might be able to parallelize tests and speed things up. Issue #2: The running time of all jobs is 1 hr 1 min [...] Suggested action items: * Given that the minimum running time is one hour, I assume the problem is in the net-ssh-simple library. Needs more investigation. I took a brief look at this and it seems like this might be a bug with the underlying plugin. I will try and contact Matty Rhodes who is the author of the plugin to take a look. Issue #3: Jenkins console log line timestamps different between full and truncated views [...] Suggested action items: * Upgrade the timestamper pluginhttps://wiki.jenkins-ci.org/display/JENKINS/Timestamper. I've gone ahead and updated the timestamper plugin on the rdjenkins box, so that this should no longer be an issue. Let me know if you find anything else. Cheers, -Nikhil ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Trove] Trove-Gate timeouts
Hi Mat: If I recall correctly, rdjenkins is actually running a version of the plugin specifically built by Matty for it (should be the latest). I'm not sure why it's still hitting the timeout issue mentioned earlier, so we'll have to check with Matty and try to figure out what the deal with that is. Cheers, -Nikhil Lowery, Mathew writes: Hi Nikhil, This is great! You may have assumed this but I wanted to be explicit: To get the higher timeout, I think all that needs to be done is to upgrade the boot-hpcloud-vm plugin. Then the plugin will pass the current timeout (7200 in rdjenkins) down to the net-ssl-simple library. (This is all based on the assumption that rdjenkins is running an older version of the plugin.) Thanks, Mat On 2/17/14, 10:13 PM, Nikhil Manchanda nik...@manchanda.me wrote: Hi Mathew: Nice work identifying these issues with the current Jenkins integration tests! I'm looking into some of these issues at the moment and am attempting to ensure that the appropriate fixes are merged so that the Trove integration gate job is healthy again. Some of my comments are inline. Lowery, Mathew writes: Hi all, Issue #1: Jobs that need more than one hour [...] Suggested action items: * If it is acceptable to have jobs that run over one hour, then install the latest boot-hpcloud-vm plugin for Jenkins which will increase the make the operation timeout match the idle timeout. Turns out that some of the configuration parameters changes checked in pushed the gate job above this timeout threshold. For the time being, I suggest we increase the timeout value of the gate job to cater to this increase. However, we also need to take a closer look at our test suite, to see how we might be able to parallelize tests and speed things up. Issue #2: The running time of all jobs is 1 hr 1 min [...] Suggested action items: * Given that the minimum running time is one hour, I assume the problem is in the net-ssh-simple library. Needs more investigation. I took a brief look at this and it seems like this might be a bug with the underlying plugin. I will try and contact Matty Rhodes who is the author of the plugin to take a look. Issue #3: Jenkins console log line timestamps different between full and truncated views [...] Suggested action items: * Upgrade the timestamper pluginhttps://wiki.jenkins-ci.org/display/JENKINS/Timestamper. I've gone ahead and updated the timestamper plugin on the rdjenkins box, so that this should no longer be an issue. Let me know if you find anything else. Cheers, -Nikhil ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [Trove] Trove-Gate timeouts
Trovesters, One reason for the longer running test was that for the configuration groups i added a creation of a new instance. This is to test a new instance will be created with a configuration group applied. This might be causing the run to be a little longer but i am surprised that its taking over an hour to run through everything still. -Craig Vyvial On Sun, Feb 16, 2014 at 12:25 AM, Mirantis dmako...@mirantis.com wrote: Hello, Mathew. I'm seeing same issues with the gate. I also tried to found out why gate job is failing. First ran into issue related to cinder installation failure in devstack. But then I found same problem as you described. The best option is to increase job time range. Thanks for such research. I hope gate will be fixed in the easiest way and for the shortest period of time. Best regards Denis Makogon. Sent from an iPad 16 февр. 2014, в 00:46, Lowery, Mathew mlow...@ebay.com написал(а): Hi all, *Issue #1: Jobs that need more than one hour* Of the last 30 Trove-Gate https://rdjenkins.dyndns.org/job/Trove-Gate/builds (spanning three days), 7 have failed due to a Jenkins job-level timeout (not a proboscis timeout). These jobs had no failed tests when the timeout occurred. Not having access to the job config to see what the job looks like, I used the console output to guess what was going on. It appears that a Jenkins plugin named boot-hpcloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L181 is booting a VM and running the commands given, including redstack int-tests. From the console output, it states that it was supplied with an ssh_shell_timeout=7200. This is passed down to another library called net-ssh-simplehttps://github.com/busyloop/net-ssh-simple/blob/e3834f259a47606bfb06a487ca701fc20dbad8a5/lib/net/ssh/simple.rb#L632. net-ssh-simple has two timeouts: an idle timeout and an operation timeout. In the latest boot-hpcloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L182, ssh_shell_timeout is passed down to net-ssh-simple for both the idle timeout and the operation timeout. But in older versions of boot-hp-cloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/9260e957d6c54142c33dd9e9632b86e17fd5c02f/models/boot_vm_concurrent.rb#L141, ssh_shell_timeout is passed down to net-ssh-simple for only the idle timeout, leaving a default operation timeout of 3600. This is why I believe these jobs are failing after exactly one hour. FYI: Here are the jobs that failed due to the Jenkins job-level timeout (and had no test failures when the timeout occurred) along with their associated patch sets: https://rdjenkins.dyndns.org/job/Trove-Gate/2532/console ( http://review.openstack.org/73786) https://rdjenkins.dyndns.org/job/Trove-Gate/2530/console ( http://review.openstack.org/73736) https://rdjenkins.dyndns.org/job/Trove-Gate/2517/console ( http://review.openstack.org/63789) https://rdjenkins.dyndns.org/job/Trove-Gate/2514/console ( https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2513/console ( https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2504/console ( https://review.openstack.org/73147) https://rdjenkins.dyndns.org/job/Trove-Gate/2503/console ( https://review.openstack.org/73147) *Suggested action items:* - If it is acceptable to have jobs that run over one hour, then install the latest boot-hpcloud-vm plugin for Jenkins which will increase the make the operation timeout match the idle timeout. *Issue #2: The running time of all jobs is 1 hr 1 min* While the Jenkins job-level timeout will end the job after one hour, it also appears to keep every job running for a minimum of one hour. To be more precise, the timeout (or minimum running time) occurs on the part of the Jenkins job that runs commands on the VM; the VM provision (which takes about one minute) is excluded from this timeout which is why the running time of all jobs is around 1 hr 1 minhttps://rdjenkins.dyndns.org/job/Trove-Gate/buildTimeTrend. A sampling of console logs showing the time the int-tests completed and when the timeout kicks in: https://rdjenkins.dyndns.org/job/Trove-Gate/2531/console (00:01:03 wasted) *04:51:12* COMMAND_0: echo refs/changes/36/73736/2 ... *05:50:10* 335.41 proboscis.case.MethodTest (test_instance_created)*05:50:10* 194.05 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize)*05:51:13* ***05:51:13* ** STDERR-BEGIN ** https://rdjenkins.dyndns.org/job/Trove-Gate/2521/console (00:06:44 wasted) *21:11:44* COMMAND_0: echo refs/changes/89/63789/13 ... *22:05:00* 195.11 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize)*22:05:00* 186.89
Re: [openstack-dev] [Trove] Trove-Gate timeouts
Hi, Craig. Yes, i thought about configurations test suits. For now core team, maybe, should extend gate running time. But for the tempest tests i would suggest to exclude some tests from 'gate'-group (the longest ones). We need to deal with it asap, because gate failing for four or five days. Best regards Denis Makogon. Sent from an iPad On Mon, Feb 17, 2014 at 6:33 AM, Craig Vyvial cp16...@gmail.com wrote: Trovesters, One reason for the longer running test was that for the configuration groups i added a creation of a new instance. This is to test a new instance will be created with a configuration group applied. This might be causing the run to be a little longer but i am surprised that its taking over an hour to run through everything still. -Craig Vyvial On Sun, Feb 16, 2014 at 12:25 AM, Mirantis dmako...@mirantis.com wrote: Hello, Mathew. I'm seeing same issues with the gate. I also tried to found out why gate job is failing. First ran into issue related to cinder installation failure in devstack. But then I found same problem as you described. The best option is to increase job time range. Thanks for such research. I hope gate will be fixed in the easiest way and for the shortest period of time. Best regards Denis Makogon. Sent from an iPad 16 февр. 2014, в 00:46, Lowery, Mathew mlow...@ebay.com написал(а): Hi all, *Issue #1: Jobs that need more than one hour* Of the last 30 Trove-Gate https://rdjenkins.dyndns.org/job/Trove-Gate/builds (spanning three days), 7 have failed due to a Jenkins job-level timeout (not a proboscis timeout). These jobs had no failed tests when the timeout occurred. Not having access to the job config to see what the job looks like, I used the console output to guess what was going on. It appears that a Jenkins plugin named boot-hpcloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L181 is booting a VM and running the commands given, including redstack int-tests. From the console output, it states that it was supplied with an ssh_shell_timeout=7200. This is passed down to another library called net-ssh-simplehttps://github.com/busyloop/net-ssh-simple/blob/e3834f259a47606bfb06a487ca701fc20dbad8a5/lib/net/ssh/simple.rb#L632. net-ssh-simple has two timeouts: an idle timeout and an operation timeout. In the latest boot-hpcloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L182, ssh_shell_timeout is passed down to net-ssh-simple for both the idle timeout and the operation timeout. But in older versions of boot-hp-cloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/9260e957d6c54142c33dd9e9632b86e17fd5c02f/models/boot_vm_concurrent.rb#L141, ssh_shell_timeout is passed down to net-ssh-simple for only the idle timeout, leaving a default operation timeout of 3600. This is why I believe these jobs are failing after exactly one hour. FYI: Here are the jobs that failed due to the Jenkins job-level timeout (and had no test failures when the timeout occurred) along with their associated patch sets: https://rdjenkins.dyndns.org/job/Trove-Gate/2532/console ( http://review.openstack.org/73786) https://rdjenkins.dyndns.org/job/Trove-Gate/2530/console ( http://review.openstack.org/73736) https://rdjenkins.dyndns.org/job/Trove-Gate/2517/console ( http://review.openstack.org/63789) https://rdjenkins.dyndns.org/job/Trove-Gate/2514/console ( https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2513/console ( https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2504/console ( https://review.openstack.org/73147) https://rdjenkins.dyndns.org/job/Trove-Gate/2503/console ( https://review.openstack.org/73147) *Suggested action items:* - If it is acceptable to have jobs that run over one hour, then install the latest boot-hpcloud-vm plugin for Jenkins which will increase the make the operation timeout match the idle timeout. *Issue #2: The running time of all jobs is 1 hr 1 min* While the Jenkins job-level timeout will end the job after one hour, it also appears to keep every job running for a minimum of one hour. To be more precise, the timeout (or minimum running time) occurs on the part of the Jenkins job that runs commands on the VM; the VM provision (which takes about one minute) is excluded from this timeout which is why the running time of all jobs is around 1 hr 1 minhttps://rdjenkins.dyndns.org/job/Trove-Gate/buildTimeTrend. A sampling of console logs showing the time the int-tests completed and when the timeout kicks in: https://rdjenkins.dyndns.org/job/Trove-Gate/2531/console (00:01:03 wasted) *04:51:12* COMMAND_0: echo refs/changes/36/73736/2 ... *05:50:10* 335.41 proboscis.case.MethodTest (test_instance_created)*05:50:10* 194.05
[openstack-dev] [Trove] Trove-Gate timeouts
Hi all, Issue #1: Jobs that need more than one hour Of the last 30 Trove-Gatehttps://rdjenkins.dyndns.org/job/Trove-Gate/ builds (spanning three days), 7 have failed due to a Jenkins job-level timeout (not a proboscis timeout). These jobs had no failed tests when the timeout occurred. Not having access to the job config to see what the job looks like, I used the console output to guess what was going on. It appears that a Jenkins plugin named boot-hpcloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L181 is booting a VM and running the commands given, including redstack int-tests. From the console output, it states that it was supplied with an ssh_shell_timeout=7200. This is passed down to another library called net-ssh-simplehttps://github.com/busyloop/net-ssh-simple/blob/e3834f259a47606bfb06a487ca701fc20dbad8a5/lib/net/ssh/simple.rb#L632. net-ssh-simple has two timeouts: an idle timeout and an operation timeout. In the latest boot-hpcloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/2272770b0ce54752eabb84229dc8939d79b2be50/models/boot_vm_concurrent.rb#L182, ssh_shell_timeout is passed down to net-ssh-simple for both the idle timeout and the operation timeout. But in older versions of boot-hp-cloud-vmhttps://github.com/mrhoades/boot-hpcloud-vm/blob/9260e957d6c54142c33dd9e9632b86e17fd5c02f/models/boot_vm_concurrent.rb#L141, ssh_shell_timeout is passed down to net-ssh-simple for only the idle timeout, leaving a default operation timeout of 3600. This is why I believe these jobs are failing after exactly one hour. FYI: Here are the jobs that failed due to the Jenkins job-level timeout (and had no test failures when the timeout occurred) along with their associated patch sets: https://rdjenkins.dyndns.org/job/Trove-Gate/2532/console (http://review.openstack.org/73786) https://rdjenkins.dyndns.org/job/Trove-Gate/2530/console (http://review.openstack.org/73736) https://rdjenkins.dyndns.org/job/Trove-Gate/2517/console (http://review.openstack.org/63789) https://rdjenkins.dyndns.org/job/Trove-Gate/2514/console (https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2513/console (https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2504/console (https://review.openstack.org/73147) https://rdjenkins.dyndns.org/job/Trove-Gate/2503/console (https://review.openstack.org/73147) Suggested action items: * If it is acceptable to have jobs that run over one hour, then install the latest boot-hpcloud-vm plugin for Jenkins which will increase the make the operation timeout match the idle timeout. Issue #2: The running time of all jobs is 1 hr 1 min While the Jenkins job-level timeout will end the job after one hour, it also appears to keep every job running for a minimum of one hour. To be more precise, the timeout (or minimum running time) occurs on the part of the Jenkins job that runs commands on the VM; the VM provision (which takes about one minute) is excluded from this timeout which is why the running time of all jobs is around 1 hr 1 minhttps://rdjenkins.dyndns.org/job/Trove-Gate/buildTimeTrend. A sampling of console logs showing the time the int-tests completed and when the timeout kicks in: https://rdjenkins.dyndns.org/job/Trove-Gate/2531/console (00:01:03 wasted) 04:51:12 COMMAND_0: echo refs/changes/36/73736/2 ... 05:50:10 335.41 proboscis.case.MethodTest (test_instance_created) 05:50:10 194.05 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize) 05:51:13 ** 05:51:13 ** STDERR-BEGIN ** https://rdjenkins.dyndns.org/job/Trove-Gate/2521/console (00:06:44 wasted) 21:11:44 COMMAND_0: echo refs/changes/89/63789/13 ... 22:05:00 195.11 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize) 22:05:00 186.89 proboscis.case.MethodTest (test_resize_down) 22:11:44 ** 22:11:44 ** STDERR-BEGIN ** https://rdjenkins.dyndns.org/job/Trove-Gate/2518/consoleFull (00:06:01 wasted) 17:46:59 COMMAND_0: echo refs/changes/02/64302/20 ... 18:40:57 210.03 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize) 18:40:57 187.89 proboscis.case.MethodTest (test_resize_down) 18:46:58 ** 18:46:58 ** STDERR-BEGIN ** Suggested action items: * Given that the minimum running time is one hour, I assume the problem is in the net-ssh-simple library. Needs more investigation. Issue #3: Jenkins console log line timestamps different between full and truncated views I assume this is due to JENKINS-17779https://issues.jenkins-ci.org/browse/JENKINS-17779. Suggested action items: * Upgrade the timestamper pluginhttps://wiki.jenkins-ci.org/display/JENKINS/Timestamper. ___
Re: [openstack-dev] [Trove] Trove-Gate timeouts
Hello, Mathew. I'm seeing same issues with the gate. I also tried to found out why gate job is failing. First ran into issue related to cinder installation failure in devstack. But then I found same problem as you described. The best option is to increase job time range. Thanks for such research. I hope gate will be fixed in the easiest way and for the shortest period of time. Best regards Denis Makogon. Sent from an iPad 16 февр. 2014, в 00:46, Lowery, Mathew mlow...@ebay.com написал(а): Hi all, Issue #1: Jobs that need more than one hour Of the last 30 Trove-Gate builds (spanning three days), 7 have failed due to a Jenkins job-level timeout (not a proboscis timeout). These jobs had no failed tests when the timeout occurred. Not having access to the job config to see what the job looks like, I used the console output to guess what was going on. It appears that a Jenkins plugin named boot-hpcloud-vm is booting a VM and running the commands given, including redstack int-tests. From the console output, it states that it was supplied with an ssh_shell_timeout=7200. This is passed down to another library called net-ssh-simple. net-ssh-simple has two timeouts: an idle timeout and an operation timeout. In the latest boot-hpcloud-vm, ssh_shell_timeout is passed down to net-ssh-simple for both the idle timeout and the operation timeout. But in older versions of boot-hp-cloud-vm, ssh_shell_timeout is passed down to net-ssh-simple for only the idle timeout, leaving a default operation timeout of 3600. This is why I believe these jobs are failing after exactly one hour. FYI: Here are the jobs that failed due to the Jenkins job-level timeout (and had no test failures when the timeout occurred) along with their associated patch sets: https://rdjenkins.dyndns.org/job/Trove-Gate/2532/console (http://review.openstack.org/73786) https://rdjenkins.dyndns.org/job/Trove-Gate/2530/console (http://review.openstack.org/73736) https://rdjenkins.dyndns.org/job/Trove-Gate/2517/console (http://review.openstack.org/63789) https://rdjenkins.dyndns.org/job/Trove-Gate/2514/console (https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2513/console (https://review.openstack.org/50944) https://rdjenkins.dyndns.org/job/Trove-Gate/2504/console (https://review.openstack.org/73147) https://rdjenkins.dyndns.org/job/Trove-Gate/2503/console (https://review.openstack.org/73147) Suggested action items: If it is acceptable to have jobs that run over one hour, then install the latest boot-hpcloud-vm plugin for Jenkins which will increase the make the operation timeout match the idle timeout. Issue #2: The running time of all jobs is 1 hr 1 min While the Jenkins job-level timeout will end the job after one hour, it also appears to keep every job running for a minimum of one hour. To be more precise, the timeout (or minimum running time) occurs on the part of the Jenkins job that runs commands on the VM; the VM provision (which takes about one minute) is excluded from this timeout which is why the running time of all jobs is around 1 hr 1 min. A sampling of console logs showing the time the int-tests completed and when the timeout kicks in: https://rdjenkins.dyndns.org/job/Trove-Gate/2531/console (00:01:03 wasted) 04:51:12 COMMAND_0: echo refs/changes/36/73736/2 ... 05:50:10 335.41 proboscis.case.MethodTest (test_instance_created) 05:50:10 194.05 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize) 05:51:13 ** 05:51:13 ** STDERR-BEGIN ** https://rdjenkins.dyndns.org/job/Trove-Gate/2521/console (00:06:44 wasted) 21:11:44 COMMAND_0: echo refs/changes/89/63789/13 ... 22:05:00 195.11 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize) 22:05:00 186.89 proboscis.case.MethodTest (test_resize_down) 22:11:44 ** 22:11:44 ** STDERR-BEGIN ** https://rdjenkins.dyndns.org/job/Trove-Gate/2518/consoleFull (00:06:01 wasted) 17:46:59 COMMAND_0: echo refs/changes/02/64302/20 ... 18:40:57 210.03 proboscis.case.MethodTest (test_instance_returns_to_active_after_resize) 18:40:57 187.89 proboscis.case.MethodTest (test_resize_down) 18:46:58 ** 18:46:58 ** STDERR-BEGIN ** Suggested action items: Given that the minimum running time is one hour, I assume the problem is in the net-ssh-simple library. Needs more investigation. Issue #3: Jenkins console log line timestamps different between full and truncated views I assume this is due to JENKINS-17779. Suggested action items: Upgrade the timestamper plugin. ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev