[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso (WMF) changed: What|Removed |Added Depends on||72113 --- Comment #14 from Antoine "hashar" Musso (WMF) --- The root cause is that the Gearman server no more response for an unknown reason. When reconnecting it (see comment #12) the jobs were still stuck in the queue due to a bug in Zuul. That is bug 72113 and the patch I wrote is applied on our Zuul and confirmed to work (merge functions are now properly retriggered when Gearman comes back). -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso (WMF) changed: What|Removed |Added See Also||https://launchpad.net/bugs/ ||1381565 --- Comment #13 from Antoine "hashar" Musso (WMF) --- That is related to bug 63758 (JJB created jobs not registering). I have upgraded Jenkins Gearman plugin to fix jobs registrations: * cherry picked https://review.openstack.org/#/c/125755/ patchset 8 * compiled it via maven * uploaded and restarted Jenkins That bumps gearman plugin with support for the Jenkins LTS version we are using which is probably going to help. I found out another issue that causes Gearman server to lock completely waiting for data to be received on a socket. Filled upstream as https://bugs.launchpad.net/gear/+bug/1381565 -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #12 from Antoine "hashar" Musso --- Documented a workaround on https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues The gearman server sometime deadlock when a job is created in Jenkins. The Gearman process is still around but TCP connections time out completely and it does not process anything. The workaround is to disconnect Jenkins from the Gearman server: head to https://integration.wikimedia.org/ci/configure logged in with a WMF ldap account search for "Gearman" uncheck "Enable Gearman" Save at the bottom search for "Gearman" check "Enable Gearman" Save at the bottom -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso changed: What|Removed |Added CC||matma@gmail.com --- Comment #11 from Antoine "hashar" Musso --- *** Bug 70256 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso changed: What|Removed |Added Status|REOPENED|NEW -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso changed: What|Removed |Added Blocks||69045 CC||niklas.laxst...@gmail.com --- Comment #10 from Antoine "hashar" Musso --- *** Bug 69045 has been marked as a duplicate of this bug. *** -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #9 from Antoine "hashar" Musso --- Created attachment 15589 --> https://bugzilla.wikimedia.org/attachment.cgi?id=15589&action=edit Zuul events spike I noticed earlier this week Zuul being trapped in some loop. Upstream has noticed it as well from time to time but never managed to track it down. Attached is a graph showing the spike of events on June 6th which is caused by the death loop. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #8 from Antoine "hashar" Musso --- Another occurrence: hashar@gallium:~$ echo status|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave0010 build:apps-android-wikipedia-maven-checkstyle10010 numbers are Total, Running, Workers. And there are working function indeed: hashar@gallium:~$ echo workers|nc -q 2 localhost 4730|fgrep apps-android-wikipedia-maven-checkstyle|cut -b1-50 54 127.0.0.1 integration-slave1002_exec-3 : build: 53 127.0.0.1 integration-slave1002_exec-1 : build: 55 127.0.0.1 integration-slave1002_exec-4 : build: 56 127.0.0.1 integration-slave1002_exec-0 : build: 57 127.0.0.1 integration-slave1002_exec-2 : build: 14 127.0.0.1 integration-slave1001_exec-0 : build: 19 127.0.0.1 integration-slave1001_exec-3 : build: 21 127.0.0.1 integration-slave1001_exec-4 : build: 22 127.0.0.1 integration-slave1001_exec-2 : build: 28 127.0.0.1 integration-slave1001_exec-1 : build: The functions registered: build:apps-android-wikipedia-maven-checkstyle build:apps-android-wikipedia-maven-checkstyle:contintLabsSlave WORKAROUND: disconnect and reconnect the labs slaves. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #7 from Antoine "hashar" Musso --- Crashed again on May 28th during european afternoon. Jobs meant to be run on labs instances ended up not being registered anymore with the Zuul Gearman server. That must be a bug in the Jenkins Gearman plugin :-( {{bug|63760}} -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #6 from Antoine "hashar" Musso --- That occurred again today around noon UTC. Jenkins/Zuul restarted at around 14:17 UTC :-( -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso changed: What|Removed |Added Status|ASSIGNED|RESOLVED Resolution|--- |FIXED --- Comment #5 from Antoine "hashar" Musso --- Seems it is no more occurring now. -- You are receiving this mail because: You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 Antoine "hashar" Musso changed: What|Removed |Added Priority|Unprioritized |High Status|NEW |ASSIGNED Assignee|wikibugs-l@lists.wikimedia. |has...@free.fr |org | --- Comment #4 from Antoine "hashar" Musso --- We got python-gear upgraded from 0.4.0 to 0.5.4 which fix a bunch of function registrations errors in Gearman. That might solve the issue. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #3 from Antoine "hashar" Musso --- I have upgraded Zuul wmf-deploy-20140122..wmf-deploy-20140416-3 . That might fix it. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #2 from Antoine "hashar" Musso --- Disconnecting and reconnecting the gearman client does unleash a few jobs. Disconnecting and reconnecting a slave does unleash them as well. Here the debug output whenever I disconnected and reconnected integration-slave1002.eqiad.wmflabs hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake812214 build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake811114 build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake810014 build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake810014 build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014 hashar@gallium:~$ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake89214 build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014 hashar@gallium:~$ It eventually managed to run them all. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 63760] Jobs are sometime no more being triggered by Zuul / Jenkins
https://bugzilla.wikimedia.org/show_bug.cgi?id=63760 --- Comment #1 from Antoine "hashar" Musso --- Once slaves are disconnected I get: $ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test build:integration-jjb-config-test:contintLabsSlave000 build:integration-jjb-config-test200 $ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake82200 build:apps-android-wikipedia-tox-flake8:contintLabsSlave000 It did process a few jobs but got stuck again: $ echo status|nc -q 2 localhost 4730|grep integration-jjb-config-test build:integration-jjb-config-test:contintLabsSlave0014 build:integration-jjb-config-test2014 $ echo status|nc -q 2 localhost 4730|grep apps-android-wikipedia-tox-flake8 build:apps-android-wikipedia-tox-flake816014 build:apps-android-wikipedia-tox-flake8:contintLabsSlave0014 -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l