Re: update on git timeouts for jenkins builds
newp. still happening, and i'm still looking in to it: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38880/console On Wed, Jul 29, 2015 at 12:20 PM, shane knapp skn...@berkeley.edu wrote: ok, i think i found the problem and solution to the git timeouts: https://stackoverflow.com/questions/12236415/git-clone-return-result-18-code-200-on-a-specific-repository so, on each worker i've run git config --global http.postBuffer 524288000 as the jenkins user and we'll see if this makes a difference. On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote: hey all, i'm just back in from my wedding weekend (woot!) and am working on figuring out what's happening w/the git timeouts for pull request builds. TL;DR: if your build fails due to a timeout, please retrigger your builds. i know this isn't the BEST solution, but until we get some stuff implemented (traffic shaping, git cache for the workers) it's the only thing i can recommend. here's a snapshot of the state of the union: $ get_timeouts.sh 5 timeouts by date: 2015-07-23 -- 3 2015-07-24 -- 1 2015-07-26 -- 7 2015-07-27 -- 18 2015-07-28 -- 9 timeouts by project: 35 SparkPullRequestBuilder 3 Tachyon-Pull-Request-Builder total builds (excepting aborted by a user): 1908 total percentage of builds timing out: 01% nothing has changed on our end AFAIK, our traffic graphs look totally fine, but starting sunday, we started seeing a spike in timeouts, with yesterday being the worst. today is also not looking good either. github is looking OK, but not great: https://status.github.com/ as a solution, we'll be setting up some traffic shaping on our end, as well as implementing a git cache on the workers so that we'll (hopefully) minimize how many hits we make against github. i was planning on doing the git cache months ago, but the timeout issue pretty much went away and i back-burnered that idea until today. other than that, i'll be posting updates as we get them. shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: update on git timeouts for jenkins builds
ok, i think i found the problem and solution to the git timeouts: https://stackoverflow.com/questions/12236415/git-clone-return-result-18-code-200-on-a-specific-repository so, on each worker i've run git config --global http.postBuffer 524288000 as the jenkins user and we'll see if this makes a difference. On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote: hey all, i'm just back in from my wedding weekend (woot!) and am working on figuring out what's happening w/the git timeouts for pull request builds. TL;DR: if your build fails due to a timeout, please retrigger your builds. i know this isn't the BEST solution, but until we get some stuff implemented (traffic shaping, git cache for the workers) it's the only thing i can recommend. here's a snapshot of the state of the union: $ get_timeouts.sh 5 timeouts by date: 2015-07-23 -- 3 2015-07-24 -- 1 2015-07-26 -- 7 2015-07-27 -- 18 2015-07-28 -- 9 timeouts by project: 35 SparkPullRequestBuilder 3 Tachyon-Pull-Request-Builder total builds (excepting aborted by a user): 1908 total percentage of builds timing out: 01% nothing has changed on our end AFAIK, our traffic graphs look totally fine, but starting sunday, we started seeing a spike in timeouts, with yesterday being the worst. today is also not looking good either. github is looking OK, but not great: https://status.github.com/ as a solution, we'll be setting up some traffic shaping on our end, as well as implementing a git cache on the workers so that we'll (hopefully) minimize how many hits we make against github. i was planning on doing the git cache months ago, but the timeout issue pretty much went away and i back-burnered that idea until today. other than that, i'll be posting updates as we get them. shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: update on git timeouts for jenkins builds
btw, the directory perm issue was only happening on amp-jenkins-worker-04 and -05. both of the broken dirs were clobbered, so we won't be seeing any more of these again. On Tue, Jul 28, 2015 at 12:28 PM, shane knapp skn...@berkeley.edu wrote: ++joshrosen ok, i found out some of what's going on. some builds were failing as such: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console note that it's unable to remove the target/ directory during the build... this is caused by 'git clean -fdx' running, and deep in the target directory there were a couple of dirs that had the wrong permission bits set: dr-xr-xr-x. 2 jenkins jenkins 4096 Jul 27 06:54 /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-615f93cc-27ad-464b-b0d4-4352c96c22ee note the missing 'w' on the owner bits. this is what was causing those failures. after manually deleting the two entries that i found (using the command below), we've whacked this mole for now. for x in $(cat jenkins_workers.txt); do echo $x; ssh $x find /home/jenkins/workspace/SparkPullRequestBuilder*/target/tmp -maxdepth 3| xargs ls -ld | egrep ^dr-x; echo; echo; done as for what exactly is messing up the perms, i'm not entirely sure. josh, you have any ideas? shane On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote: hey all, i'm just back in from my wedding weekend (woot!) and am working on figuring out what's happening w/the git timeouts for pull request builds. TL;DR: if your build fails due to a timeout, please retrigger your builds. i know this isn't the BEST solution, but until we get some stuff implemented (traffic shaping, git cache for the workers) it's the only thing i can recommend. here's a snapshot of the state of the union: $ get_timeouts.sh 5 timeouts by date: 2015-07-23 -- 3 2015-07-24 -- 1 2015-07-26 -- 7 2015-07-27 -- 18 2015-07-28 -- 9 timeouts by project: 35 SparkPullRequestBuilder 3 Tachyon-Pull-Request-Builder total builds (excepting aborted by a user): 1908 total percentage of builds timing out: 01% nothing has changed on our end AFAIK, our traffic graphs look totally fine, but starting sunday, we started seeing a spike in timeouts, with yesterday being the worst. today is also not looking good either. github is looking OK, but not great: https://status.github.com/ as a solution, we'll be setting up some traffic shaping on our end, as well as implementing a git cache on the workers so that we'll (hopefully) minimize how many hits we make against github. i was planning on doing the git cache months ago, but the timeout issue pretty much went away and i back-burnered that idea until today. other than that, i'll be posting updates as we get them. shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: update on git timeouts for jenkins builds
++joshrosen ok, i found out some of what's going on. some builds were failing as such: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console note that it's unable to remove the target/ directory during the build... this is caused by 'git clean -fdx' running, and deep in the target directory there were a couple of dirs that had the wrong permission bits set: dr-xr-xr-x. 2 jenkins jenkins 4096 Jul 27 06:54 /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-615f93cc-27ad-464b-b0d4-4352c96c22ee note the missing 'w' on the owner bits. this is what was causing those failures. after manually deleting the two entries that i found (using the command below), we've whacked this mole for now. for x in $(cat jenkins_workers.txt); do echo $x; ssh $x find /home/jenkins/workspace/SparkPullRequestBuilder*/target/tmp -maxdepth 3| xargs ls -ld | egrep ^dr-x; echo; echo; done as for what exactly is messing up the perms, i'm not entirely sure. josh, you have any ideas? shane On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote: hey all, i'm just back in from my wedding weekend (woot!) and am working on figuring out what's happening w/the git timeouts for pull request builds. TL;DR: if your build fails due to a timeout, please retrigger your builds. i know this isn't the BEST solution, but until we get some stuff implemented (traffic shaping, git cache for the workers) it's the only thing i can recommend. here's a snapshot of the state of the union: $ get_timeouts.sh 5 timeouts by date: 2015-07-23 -- 3 2015-07-24 -- 1 2015-07-26 -- 7 2015-07-27 -- 18 2015-07-28 -- 9 timeouts by project: 35 SparkPullRequestBuilder 3 Tachyon-Pull-Request-Builder total builds (excepting aborted by a user): 1908 total percentage of builds timing out: 01% nothing has changed on our end AFAIK, our traffic graphs look totally fine, but starting sunday, we started seeing a spike in timeouts, with yesterday being the worst. today is also not looking good either. github is looking OK, but not great: https://status.github.com/ as a solution, we'll be setting up some traffic shaping on our end, as well as implementing a git cache on the workers so that we'll (hopefully) minimize how many hits we make against github. i was planning on doing the git cache months ago, but the timeout issue pretty much went away and i back-burnered that idea until today. other than that, i'll be posting updates as we get them. shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: update on git timeouts for jenkins builds
git caches are set up on all workers for the pull request builder, and builds are building w/the cache... however in the build logs it doesn't seem to be actually *hitting* the cache, so i guess i'll be doing some more poking and prodding to see wtf is going on. On Tue, Jul 28, 2015 at 12:49 PM, shane knapp skn...@berkeley.edu wrote: btw, the directory perm issue was only happening on amp-jenkins-worker-04 and -05. both of the broken dirs were clobbered, so we won't be seeing any more of these again. On Tue, Jul 28, 2015 at 12:28 PM, shane knapp skn...@berkeley.edu wrote: ++joshrosen ok, i found out some of what's going on. some builds were failing as such: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console note that it's unable to remove the target/ directory during the build... this is caused by 'git clean -fdx' running, and deep in the target directory there were a couple of dirs that had the wrong permission bits set: dr-xr-xr-x. 2 jenkins jenkins 4096 Jul 27 06:54 /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-615f93cc-27ad-464b-b0d4-4352c96c22ee note the missing 'w' on the owner bits. this is what was causing those failures. after manually deleting the two entries that i found (using the command below), we've whacked this mole for now. for x in $(cat jenkins_workers.txt); do echo $x; ssh $x find /home/jenkins/workspace/SparkPullRequestBuilder*/target/tmp -maxdepth 3| xargs ls -ld | egrep ^dr-x; echo; echo; done as for what exactly is messing up the perms, i'm not entirely sure. josh, you have any ideas? shane On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote: hey all, i'm just back in from my wedding weekend (woot!) and am working on figuring out what's happening w/the git timeouts for pull request builds. TL;DR: if your build fails due to a timeout, please retrigger your builds. i know this isn't the BEST solution, but until we get some stuff implemented (traffic shaping, git cache for the workers) it's the only thing i can recommend. here's a snapshot of the state of the union: $ get_timeouts.sh 5 timeouts by date: 2015-07-23 -- 3 2015-07-24 -- 1 2015-07-26 -- 7 2015-07-27 -- 18 2015-07-28 -- 9 timeouts by project: 35 SparkPullRequestBuilder 3 Tachyon-Pull-Request-Builder total builds (excepting aborted by a user): 1908 total percentage of builds timing out: 01% nothing has changed on our end AFAIK, our traffic graphs look totally fine, but starting sunday, we started seeing a spike in timeouts, with yesterday being the worst. today is also not looking good either. github is looking OK, but not great: https://status.github.com/ as a solution, we'll be setting up some traffic shaping on our end, as well as implementing a git cache on the workers so that we'll (hopefully) minimize how many hits we make against github. i was planning on doing the git cache months ago, but the timeout issue pretty much went away and i back-burnered that idea until today. other than that, i'll be posting updates as we get them. shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org