[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-11-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso (WMF)  changed:

   What|Removed |Added

 Depends on||72113

--- Comment #14 from Antoine "hashar" Musso (WMF)  ---
The root cause is that the Gearman server no more response for an unknown
reason.

When reconnecting it (see comment #12) the jobs were still stuck in the queue
due to a bug in Zuul. That is bug 72113 and the patch I wrote is applied on our
Zuul and confirmed to work (merge functions are now properly retriggered when
Gearman comes back).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-10-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso (WMF)  changed:

   What|Removed |Added

   See Also||https://launchpad.net/bugs/
   ||1381565

--- Comment #13 from Antoine "hashar" Musso (WMF)  ---
That is related to bug 63758 (JJB created jobs not registering).

I have upgraded Jenkins Gearman plugin to fix jobs registrations:
* cherry picked https://review.openstack.org/#/c/125755/ patchset 8
* compiled it via maven
* uploaded and restarted Jenkins


That bumps gearman plugin with support for the Jenkins LTS version we are using
which is probably going to help.



I found out another issue that causes Gearman server to lock completely waiting
for data to be received on a socket. Filled upstream as
https://bugs.launchpad.net/gear/+bug/1381565

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-09-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #12 from Antoine "hashar" Musso  ---
Documented a workaround on
https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues


The gearman server sometime deadlock when a job is created in Jenkins. The
Gearman process is still around but TCP connections time out completely and it
does not process anything. The workaround is to disconnect Jenkins from the
Gearman server:

head to https://integration.wikimedia.org/ci/configure logged in with a WMF
ldap account
search for "Gearman"
uncheck "Enable Gearman"
Save at the bottom
search for "Gearman"
check "Enable Gearman"
Save at the bottom

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-09-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso  changed:

   What|Removed |Added

 CC||matma@gmail.com

--- Comment #11 from Antoine "hashar" Musso  ---
*** Bug 70256 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-08-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso  changed:

   What|Removed |Added

 Status|REOPENED|NEW

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-08-02 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso  changed:

   What|Removed |Added

 Blocks||69045
 CC||niklas.laxst...@gmail.com

--- Comment #10 from Antoine "hashar" Musso  ---
*** Bug 69045 has been marked as a duplicate of this bug. ***

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-06-07 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #9 from Antoine "hashar" Musso  ---
Created attachment 15589
  --> https://bugzilla.wikimedia.org/attachment.cgi?id=15589&action=edit
Zuul events spike

I noticed earlier this week Zuul being trapped in some loop.  Upstream has
noticed it as well from time to time but never managed to track it down.  
Attached is a graph showing the spike of events on June 6th which is caused by
the death loop.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-05-30 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #8 from Antoine "hashar" Musso  ---
Another occurrence:


hashar@gallium:~$ echo status|nc -q 2 localhost 4730|fgrep
apps-android-wikipedia-maven-checkstyle
build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave0010
build:apps-android-wikipedia-maven-checkstyle10010

numbers are Total, Running, Workers.


And there are working function indeed:


hashar@gallium:~$ echo workers|nc -q 2 localhost 4730|fgrep
apps-android-wikipedia-maven-checkstyle|cut -b1-50
54 127.0.0.1 integration-slave1002_exec-3 : build:
53 127.0.0.1 integration-slave1002_exec-1 : build:
55 127.0.0.1 integration-slave1002_exec-4 : build:
56 127.0.0.1 integration-slave1002_exec-0 : build:
57 127.0.0.1 integration-slave1002_exec-2 : build:
14 127.0.0.1 integration-slave1001_exec-0 : build:
19 127.0.0.1 integration-slave1001_exec-3 : build:
21 127.0.0.1 integration-slave1001_exec-4 : build:
22 127.0.0.1 integration-slave1001_exec-2 : build:
28 127.0.0.1 integration-slave1001_exec-1 : build:

The functions registered:

 build:apps-android-wikipedia-maven-checkstyle 
 build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave



WORKAROUND: disconnect and reconnect the labs slaves.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-05-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #7 from Antoine "hashar" Musso  ---
Crashed again on May 28th during european afternoon.

Jobs meant to be run on labs instances ended up not being registered anymore
with the Zuul Gearman server.   That must be a bug in the Jenkins Gearman
plugin :-(  {{bug|63760}}

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-05-23 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #6 from Antoine "hashar" Musso  ---
That occurred again today around noon UTC. Jenkins/Zuul restarted at around
14:17 UTC  :-(

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-05-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso  changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 Resolution|--- |FIXED

--- Comment #5 from Antoine "hashar" Musso  ---
Seems it is no more occurring now.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-04-29 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

Antoine "hashar" Musso  changed:

   What|Removed |Added

   Priority|Unprioritized   |High
 Status|NEW |ASSIGNED
   Assignee|wikibugs-l@lists.wikimedia. |has...@free.fr
   |org |

--- Comment #4 from Antoine "hashar" Musso  ---
We got python-gear upgraded from 0.4.0 to 0.5.4 which fix a bunch of function
registrations errors in Gearman.  That might solve the issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-04-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #3 from Antoine "hashar" Musso  ---
I have upgraded Zuul wmf-deploy-20140122..wmf-deploy-20140416-3 . That might
fix it.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-04-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #2 from Antoine "hashar" Musso  ---
Disconnecting and reconnecting the gearman client does unleash a few jobs.

Disconnecting and reconnecting a slave does unleash them as well.


Here the debug output whenever I disconnected and reconnected
integration-slave1002.eqiad.wmflabs


hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep
apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake812214
build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep
apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake811114
build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep
apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake810014
build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep
apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake810014
build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014
hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep
apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake89214
build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014
hashar@gallium:~$ 


It eventually managed to run them all.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins

2014-04-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760

--- Comment #1 from Antoine "hashar" Musso  ---
Once slaves are disconnected I get:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test:contintLabsSlave000
build:integration-jjb-config-test200

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake82200
build:apps-android-wikipedia-tox-flake8:contintLabsSlave000


It did process a few jobs but got stuck again:

$ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test
build:integration-jjb-config-test:contintLabsSlave0014
build:integration-jjb-config-test2014

$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8
build:apps-android-wikipedia-tox-flake816014
build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l