Re: update on git timeouts for jenkins builds

2015-07-29 Thread shane knapp
newp.  still happening, and i'm still looking in to it:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38880/console

On Wed, Jul 29, 2015 at 12:20 PM, shane knapp skn...@berkeley.edu wrote:
 ok, i think i found the problem and solution to the git timeouts:

 https://stackoverflow.com/questions/12236415/git-clone-return-result-18-code-200-on-a-specific-repository

 so, on each worker i've run git config --global http.postBuffer
 524288000 as the jenkins user and we'll see if this makes a
 difference.

 On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote:
 hey all, i'm just back in from my wedding weekend (woot!) and am
 working on figuring out what's happening w/the git timeouts for pull
 request builds.

 TL;DR:  if your build fails due to a timeout, please retrigger your
 builds.  i know this isn't the BEST solution, but until we get some
 stuff implemented (traffic shaping, git cache for the workers) it's
 the only thing i can recommend.

 here's a snapshot of the state of the union:
 $ get_timeouts.sh 5
 timeouts by date:
 2015-07-23 -- 3
 2015-07-24 -- 1
 2015-07-26 -- 7
 2015-07-27 -- 18
 2015-07-28 -- 9

 timeouts by project:
  35 SparkPullRequestBuilder
   3 Tachyon-Pull-Request-Builder
 total builds (excepting aborted by a user):
 1908

 total percentage of builds timing out:
 01%

 nothing has changed on our end AFAIK, our traffic graphs look totally
 fine, but starting sunday, we started seeing a spike in timeouts, with
 yesterday being the worst.  today is also not looking good either.

 github is looking OK, but not great:
 https://status.github.com/

 as a solution, we'll be setting up some traffic shaping on our end, as
 well as implementing a git cache on the workers so that we'll
 (hopefully) minimize how many hits we make against github.  i was
 planning on doing the git cache months ago, but the timeout issue
 pretty much went away and i back-burnered that idea until today.

 other than that, i'll be posting updates as we get them.

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: update on git timeouts for jenkins builds

2015-07-29 Thread shane knapp
ok, i think i found the problem and solution to the git timeouts:

https://stackoverflow.com/questions/12236415/git-clone-return-result-18-code-200-on-a-specific-repository

so, on each worker i've run git config --global http.postBuffer
524288000 as the jenkins user and we'll see if this makes a
difference.

On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote:
 hey all, i'm just back in from my wedding weekend (woot!) and am
 working on figuring out what's happening w/the git timeouts for pull
 request builds.

 TL;DR:  if your build fails due to a timeout, please retrigger your
 builds.  i know this isn't the BEST solution, but until we get some
 stuff implemented (traffic shaping, git cache for the workers) it's
 the only thing i can recommend.

 here's a snapshot of the state of the union:
 $ get_timeouts.sh 5
 timeouts by date:
 2015-07-23 -- 3
 2015-07-24 -- 1
 2015-07-26 -- 7
 2015-07-27 -- 18
 2015-07-28 -- 9

 timeouts by project:
  35 SparkPullRequestBuilder
   3 Tachyon-Pull-Request-Builder
 total builds (excepting aborted by a user):
 1908

 total percentage of builds timing out:
 01%

 nothing has changed on our end AFAIK, our traffic graphs look totally
 fine, but starting sunday, we started seeing a spike in timeouts, with
 yesterday being the worst.  today is also not looking good either.

 github is looking OK, but not great:
 https://status.github.com/

 as a solution, we'll be setting up some traffic shaping on our end, as
 well as implementing a git cache on the workers so that we'll
 (hopefully) minimize how many hits we make against github.  i was
 planning on doing the git cache months ago, but the timeout issue
 pretty much went away and i back-burnered that idea until today.

 other than that, i'll be posting updates as we get them.

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
btw, the directory perm issue was only happening on
amp-jenkins-worker-04 and -05.  both of the broken dirs were
clobbered, so we won't be seeing any more of these again.

On Tue, Jul 28, 2015 at 12:28 PM, shane knapp skn...@berkeley.edu wrote:
 ++joshrosen

 ok, i found out some of what's going on.  some builds were failing as such:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console

 note that it's unable to remove the target/ directory during the
 build...  this is caused by 'git clean -fdx' running, and deep in the
 target directory there were a couple of dirs that had the wrong
 permission bits set:

 dr-xr-xr-x.  2 jenkins jenkins 4096 Jul 27 06:54
 /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-615f93cc-27ad-464b-b0d4-4352c96c22ee

 note the missing 'w' on the owner bits.  this is what was causing
 those failures.  after manually deleting the two entries that i found
 (using the command below), we've whacked this mole for now.

 for x in $(cat jenkins_workers.txt); do echo $x; ssh $x find
 /home/jenkins/workspace/SparkPullRequestBuilder*/target/tmp -maxdepth
 3| xargs ls -ld | egrep ^dr-x; echo; echo; done

 as for what exactly is messing up the perms, i'm not entirely sure.
 josh, you have any ideas?

 shane

 On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote:
 hey all, i'm just back in from my wedding weekend (woot!) and am
 working on figuring out what's happening w/the git timeouts for pull
 request builds.

 TL;DR:  if your build fails due to a timeout, please retrigger your
 builds.  i know this isn't the BEST solution, but until we get some
 stuff implemented (traffic shaping, git cache for the workers) it's
 the only thing i can recommend.

 here's a snapshot of the state of the union:
 $ get_timeouts.sh 5
 timeouts by date:
 2015-07-23 -- 3
 2015-07-24 -- 1
 2015-07-26 -- 7
 2015-07-27 -- 18
 2015-07-28 -- 9

 timeouts by project:
  35 SparkPullRequestBuilder
   3 Tachyon-Pull-Request-Builder
 total builds (excepting aborted by a user):
 1908

 total percentage of builds timing out:
 01%

 nothing has changed on our end AFAIK, our traffic graphs look totally
 fine, but starting sunday, we started seeing a spike in timeouts, with
 yesterday being the worst.  today is also not looking good either.

 github is looking OK, but not great:
 https://status.github.com/

 as a solution, we'll be setting up some traffic shaping on our end, as
 well as implementing a git cache on the workers so that we'll
 (hopefully) minimize how many hits we make against github.  i was
 planning on doing the git cache months ago, but the timeout issue
 pretty much went away and i back-burnered that idea until today.

 other than that, i'll be posting updates as we get them.

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
++joshrosen

ok, i found out some of what's going on.  some builds were failing as such:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console

note that it's unable to remove the target/ directory during the
build...  this is caused by 'git clean -fdx' running, and deep in the
target directory there were a couple of dirs that had the wrong
permission bits set:

dr-xr-xr-x.  2 jenkins jenkins 4096 Jul 27 06:54
/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-615f93cc-27ad-464b-b0d4-4352c96c22ee

note the missing 'w' on the owner bits.  this is what was causing
those failures.  after manually deleting the two entries that i found
(using the command below), we've whacked this mole for now.

for x in $(cat jenkins_workers.txt); do echo $x; ssh $x find
/home/jenkins/workspace/SparkPullRequestBuilder*/target/tmp -maxdepth
3| xargs ls -ld | egrep ^dr-x; echo; echo; done

as for what exactly is messing up the perms, i'm not entirely sure.
josh, you have any ideas?

shane

On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote:
 hey all, i'm just back in from my wedding weekend (woot!) and am
 working on figuring out what's happening w/the git timeouts for pull
 request builds.

 TL;DR:  if your build fails due to a timeout, please retrigger your
 builds.  i know this isn't the BEST solution, but until we get some
 stuff implemented (traffic shaping, git cache for the workers) it's
 the only thing i can recommend.

 here's a snapshot of the state of the union:
 $ get_timeouts.sh 5
 timeouts by date:
 2015-07-23 -- 3
 2015-07-24 -- 1
 2015-07-26 -- 7
 2015-07-27 -- 18
 2015-07-28 -- 9

 timeouts by project:
  35 SparkPullRequestBuilder
   3 Tachyon-Pull-Request-Builder
 total builds (excepting aborted by a user):
 1908

 total percentage of builds timing out:
 01%

 nothing has changed on our end AFAIK, our traffic graphs look totally
 fine, but starting sunday, we started seeing a spike in timeouts, with
 yesterday being the worst.  today is also not looking good either.

 github is looking OK, but not great:
 https://status.github.com/

 as a solution, we'll be setting up some traffic shaping on our end, as
 well as implementing a git cache on the workers so that we'll
 (hopefully) minimize how many hits we make against github.  i was
 planning on doing the git cache months ago, but the timeout issue
 pretty much went away and i back-burnered that idea until today.

 other than that, i'll be posting updates as we get them.

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: update on git timeouts for jenkins builds

2015-07-28 Thread shane knapp
git caches are set up on all workers for the pull request builder, and
builds are building w/the cache...  however in the build logs it
doesn't seem to be actually *hitting* the cache, so i guess i'll be
doing some more poking and prodding to see wtf is going on.


On Tue, Jul 28, 2015 at 12:49 PM, shane knapp skn...@berkeley.edu wrote:
 btw, the directory perm issue was only happening on
 amp-jenkins-worker-04 and -05.  both of the broken dirs were
 clobbered, so we won't be seeing any more of these again.

 On Tue, Jul 28, 2015 at 12:28 PM, shane knapp skn...@berkeley.edu wrote:
 ++joshrosen

 ok, i found out some of what's going on.  some builds were failing as such:
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/38749/console

 note that it's unable to remove the target/ directory during the
 build...  this is caused by 'git clean -fdx' running, and deep in the
 target directory there were a couple of dirs that had the wrong
 permission bits set:

 dr-xr-xr-x.  2 jenkins jenkins 4096 Jul 27 06:54
 /home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-615f93cc-27ad-464b-b0d4-4352c96c22ee

 note the missing 'w' on the owner bits.  this is what was causing
 those failures.  after manually deleting the two entries that i found
 (using the command below), we've whacked this mole for now.

 for x in $(cat jenkins_workers.txt); do echo $x; ssh $x find
 /home/jenkins/workspace/SparkPullRequestBuilder*/target/tmp -maxdepth
 3| xargs ls -ld | egrep ^dr-x; echo; echo; done

 as for what exactly is messing up the perms, i'm not entirely sure.
 josh, you have any ideas?

 shane

 On Tue, Jul 28, 2015 at 11:51 AM, shane knapp skn...@berkeley.edu wrote:
 hey all, i'm just back in from my wedding weekend (woot!) and am
 working on figuring out what's happening w/the git timeouts for pull
 request builds.

 TL;DR:  if your build fails due to a timeout, please retrigger your
 builds.  i know this isn't the BEST solution, but until we get some
 stuff implemented (traffic shaping, git cache for the workers) it's
 the only thing i can recommend.

 here's a snapshot of the state of the union:
 $ get_timeouts.sh 5
 timeouts by date:
 2015-07-23 -- 3
 2015-07-24 -- 1
 2015-07-26 -- 7
 2015-07-27 -- 18
 2015-07-28 -- 9

 timeouts by project:
  35 SparkPullRequestBuilder
   3 Tachyon-Pull-Request-Builder
 total builds (excepting aborted by a user):
 1908

 total percentage of builds timing out:
 01%

 nothing has changed on our end AFAIK, our traffic graphs look totally
 fine, but starting sunday, we started seeing a spike in timeouts, with
 yesterday being the worst.  today is also not looking good either.

 github is looking OK, but not great:
 https://status.github.com/

 as a solution, we'll be setting up some traffic shaping on our end, as
 well as implementing a git cache on the workers so that we'll
 (hopefully) minimize how many hits we make against github.  i was
 planning on doing the git cache months ago, but the timeout issue
 pretty much went away and i back-burnered that idea until today.

 other than that, i'll be posting updates as we get them.

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org