Re: SIGMOD System Award for Apache Spark

2022-05-12 Thread shane knapp
woot!  :D

On Thu, May 12, 2022 at 4:27 PM Hyukjin Kwon  wrote:

> Awesome!
>
> On Fri, May 13, 2022 at 5:29 AM Mosharaf Chowdhury 
> wrote:
>
>> Wow! Congratulations to everyone indeed.
>>
>> On Thu, May 12, 2022 at 3:44 PM Matei Zaharia 
>> wrote:
>>
>>> Hi all,
>>>
>>> We recently found out that Apache Spark received
>>> <https://sigmod.org/2022-sigmod-awards/> the SIGMOD System Award this
>>> year, given by SIGMOD (the ACM’s data management research organization) to
>>> impactful real-world and research systems. This puts Spark in good company
>>> with some very impressive previous recipients
>>> <https://sigmod.org/sigmod-awards/sigmod-systems-award/>. This award is
>>> really an achievement by the whole community, so I wanted to say congrats
>>> to everyone who contributes to Spark, whether through code, issue reports,
>>> docs, or other means.
>>>
>>> Matei
>>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
https://sky.cs.berkeley.edu/


Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-27 Thread shane knapp
# sysctl stop jenkins
#



goodbye jenkins!  

On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠  wrote:

> hey everyone!
>
> after a marathon run of nearly a decade, we're finally going to be
> shutting down {amp|rise}lab jenkins at the end of this month...
>
> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>
> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>
> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
> system, but technology has moved on and running a dedicated set of servers
> for just one open source project is just too expensive for us here at uc
> berkeley.
>
> if there's interest, i'll fire up a zoom session and all y'alls can watch
> me type the final command:
>
> systemctl stop jenkins
>
> feeling bittersweet,
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-07 Thread shane knapp
created an issue to track stuff:

https://issues.apache.org/jira/browse/SPARK-37571

On Tue, Dec 7, 2021 at 8:25 AM shane knapp ☠  wrote:

> Will you be nuking all the Jenkins-related code in the repo after the 23rd?
>>
>> probably not right away...  but soon after jenkins is shut down.  bits of
> the docs and spark website will need to be updated as well.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-07 Thread shane knapp
>
> Will you be nuking all the Jenkins-related code in the repo after the 23rd?
>
> probably not right away...  but soon after jenkins is shut down.  bits of
the docs and spark website will need to be updated as well.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread shane knapp
hey everyone!

after a marathon run of nearly a decade, we're finally going to be shutting
down {amp|rise}lab jenkins at the end of this month...

the earliest snapshot i could find is from 2013 with builds for spark 0.7:
https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/

it's been a hell of a run, and i'm gonna miss randomly tweaking the build
system, but technology has moved on and running a dedicated set of servers
for just one open source project is just too expensive for us here at uc
berkeley.

if there's interest, i'll fire up a zoom session and all y'alls can watch
me type the final command:

systemctl stop jenkins

feeling bittersweet,

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-12 Thread shane knapp
woot!  nice work everyone!  :)

On Fri, Nov 12, 2021 at 11:37 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark community has been working on Java 17 support under the
> following JIRA.
>
> https://issues.apache.org/jira/browse/SPARK-33772
>
> As of today, Apache Spark starts to have daily Java 17 test coverage via
> GitHub Action jobs for Apache Spark 3.3.
>
>
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L38-L39
>
> Today's successful run is here.
>
> https://github.com/apache/spark/actions/runs/1453788012
>
> Please note that we are still working on some new Java 17 features like
>
> JEP 391: macOS/AArch64 Port
> https://bugs.openjdk.java.net/browse/JDK-8251280
>
> For example, Oracle Java, Azul Zulu, and Eclipse Temurin Java 17 already
> support Apple Silicon natively, but some 3rd party libraries like
> RocksDB/LevelDB are not ready yet. Since Mac is one of the popular dev
> environments, we are going to keep monitoring and improving gradually for
> Apache Spark 3.3.
>
> Please test Java 17 and let us know your feedback.
>
> Thanks,
> Dongjoon.
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] quick jenkins reboot

2021-10-22 Thread shane knapp
we've been back for about an hour.  :)

On Fri, Oct 22, 2021 at 1:52 PM shane knapp ☠  wrote:

> system load on the primary is getting suspiciously high, and free ram has
> mysteriously disappeared and we are rapidly approaching swap.  whatever
> could it be?
>
> java.
>
> i'm going to take this opportunity to reboot everything and start from a
> clean-ish state.  we'll be down for ~45m or so.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] quick jenkins reboot

2021-10-22 Thread shane knapp
system load on the primary is getting suspiciously high, and free ram has
mysteriously disappeared and we are rapidly approaching swap.  whatever
could it be?

java.

i'm going to take this opportunity to reboot everything and start from a
clean-ish state.  we'll be down for ~45m or so.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] DNS outage @ uc berkeley, jenkins not available

2021-09-01 Thread shane knapp
this was resolved by campus IT around 930pm last night.

On Tue, Aug 31, 2021 at 12:54 PM shane knapp ☠  wrote:
>
> we're having some DNS issues here in the EECS department, and our
> crack team is working on getting it resolved asap.  until then,
> jenkins isn't visible to the outside world.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu



-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



[build system] DNS outage @ uc berkeley, jenkins not available

2021-08-31 Thread shane knapp
we're having some DNS issues here in the EECS department, and our
crack team is working on getting it resolved asap.  until then,
jenkins isn't visible to the outside world.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [build system] quick jenkins restart

2021-08-25 Thread shane knapp
aaand we're back!

On Wed, Aug 25, 2021 at 9:24 AM shane knapp ☠  wrote:

> i'll be:
> - upgrading jenkins to the latest LTS
> - moving jenkins to java 11 (from java 8)
> - rebooting everything
>
> sorry for the disruption...  there aren't many builds running right now so
> i'll just get this sorted.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] quick jenkins restart

2021-08-25 Thread shane knapp
i'll be:
- upgrading jenkins to the latest LTS
- moving jenkins to java 11 (from java 8)
- rebooting everything

sorry for the disruption...  there aren't many builds running right now so
i'll just get this sorted.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] half of the jenkins workers are down

2021-08-09 Thread shane knapp
turns out that minikube/k8s and friends were being oom-killed and this was
causing all sorts of weirdnesses.

i've upped the ram limits on all of the k8s jobs to 8G (from 6G), and we'll
keep an eye on things and see how they go.

On Mon, Aug 9, 2021 at 12:02 PM shane knapp ☠  wrote:

> as workers are continuing to fail, i've stopped jenkins from accepting new
> builds for the time being.
>
> more updates as they come.
>
> On Mon, Aug 9, 2021 at 9:17 AM shane knapp ☠  wrote:
>
>> happy monday!
>>
>> the server gods did not smile upon us this weekend, and 4 of the workers
>> are down.  we'll most likely need to head to our colo some time today and
>> give them an in-person kick and see what's going on.
>>
>> i'll send an update when they're back up.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] half of the jenkins workers are down

2021-08-09 Thread shane knapp
as workers are continuing to fail, i've stopped jenkins from accepting new
builds for the time being.

more updates as they come.

On Mon, Aug 9, 2021 at 9:17 AM shane knapp ☠  wrote:

> happy monday!
>
> the server gods did not smile upon us this weekend, and 4 of the workers
> are down.  we'll most likely need to head to our colo some time today and
> give them an in-person kick and see what's going on.
>
> i'll send an update when they're back up.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] half of the jenkins workers are down

2021-08-09 Thread shane knapp
happy monday!

the server gods did not smile upon us this weekend, and 4 of the workers
are down.  we'll most likely need to head to our colo some time today and
give them an in-person kick and see what's going on.

i'll send an update when they're back up.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins "freeze" for remainder of 2021

2021-07-28 Thread shane knapp
since we're sunsetting jenkins by the end of 2021, i'd like to institute a
general freeze on package/feature requests.

this includes, but is not limited to things like python packages, new
versions of python, and pretty much anything that requires changes to the
bare-metal systems that run jenkins.

exceptions to this rule include new branches (spark 3.3, i'm looking at
you!), and any major security or critical fixes required for builds.

please let us know if you have any questions!

thanks in advance,

brian & shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: please read: current state and the future of the apache spark build system

2021-07-28 Thread shane knapp
3 months later, i have some updates!

TLDR1:   we're shutting jenkins down at the end of 2021.
>
> this is still the goal, exact shutdown date TBD.


> long term (until EOY):
> * decide what the future of spark builds and releases will look like
>   - do we need jenkins?
>
  - if we do, who's responsible for hosting + ops?
>

this looks like github actions + some as-of-yet-tbd k8s solution for
integration tests.


> medium term (in 6 months):
>
* prepare jenkins worker ansible configs and stick in the spark repo
>

this is done:
https://github.com/apache/spark/tree/master/dev/ansible-for-test-node


> * train up brian shiratsuki (cced) to help w/ops tasks and upgrades over
> the next ~6m
>

this is ongoing, and we now have reasonable monitoring!


> * get to all of the python version, library installation, etc etc jira
> requests
>
> i think i've knocked out most of these.


> short term(weeks):
> * bring up additional workers
>   - finish hardware/system level repairs on the bare metal
>   - see above, re k8s jira
> * stabilize cluster
>   - recent jenkins LTS upgrade broke the web GUI
>   - finish deploying monitoring/alerting
>   - this hardware is OLD and literally falling over, so we have lots of
> random disk and ram failures.  it's literally whack-a-mole and each trip to
> the colo to repair literally takes a full day
>
> we're generally doing alright w/all of these:  the hardware has been
pretty stable, the jenkins administrative GUI is still broken (but at least
i can hack the xml on the bare metal), and we've got 8 workers up and
running.

i'll be sending out another email to this list soon regarding the impending
jenkins 'freeze'.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins downtime today

2021-07-22 Thread shane knapp
that actually went much faster than anticipated, and we're already back up
and building!

On Thu, Jul 22, 2021 at 10:24 AM shane knapp ☠  wrote:

> i'll be taking jenkins down for a couple of hours today to reboot/clean up
> the workers and finish up the python package installs covered in
> https://github.com/apache/spark/pull/33469/files
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins downtime today

2021-07-22 Thread shane knapp
i'll be taking jenkins down for a couple of hours today to reboot/clean up
the workers and finish up the python package installs covered in
https://github.com/apache/spark/pull/33469/files

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: quick jenkins restart

2021-07-09 Thread shane knapp
we're back up!

On Fri, Jul 9, 2021 at 10:23 AM shane knapp ☠  wrote:

> the primary is running out of memory pretty quickly, and i'm going to
> reboot the server quickly so that it doesn't crash over the weekend.
>
> we'll investigate a bit more next week.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


quick jenkins restart

2021-07-09 Thread shane knapp
the primary is running out of memory pretty quickly, and i'm going to
reboot the server quickly so that it doesn't crash over the weekend.

we'll investigate a bit more next week.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: How to think about SparkPullRequestBuilder-K8s?

2021-06-11 Thread shane knapp
we're back.

On Fri, Jun 11, 2021 at 2:30 PM shane knapp ☠  wrote:

> btw i just noticed jenkins was down, and i restarted the primary node.
>
> On Fri, Jun 11, 2021 at 12:09 PM Sean Owen  wrote:
>
>> I find that somewhat often, the K8S PR builders will fail on a PR:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/
>>
>> ... when the PR seems totally unrelated to K8S. I've kind of learned to
>> ignore them in that case but that seems wrong. Are they just kind of flaky?
>> am I imagining things? Just trying to figure out how much they're
>> 'accurate' in catching real vs false failures.
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: How to think about SparkPullRequestBuilder-K8s?

2021-06-11 Thread shane knapp
btw i just noticed jenkins was down, and i restarted the primary node.

On Fri, Jun 11, 2021 at 12:09 PM Sean Owen  wrote:

> I find that somewhat often, the K8S PR builders will fail on a PR:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/
>
> ... when the PR seems totally unrelated to K8S. I've kind of learned to
> ignore them in that case but that seems wrong. Are they just kind of flaky?
> am I imagining things? Just trying to figure out how much they're
> 'accurate' in catching real vs false failures.
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins down, working on it

2021-05-04 Thread shane knapp
we're back and building!

On Tue, May 4, 2021 at 4:03 PM shane knapp ☠  wrote:

> jenkins went down some time in the past few days, and i'm currently
> investigating.
>
> if it's been down a while, i apologize as i've been dealing w/some health
> issues.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins down, working on it

2021-05-04 Thread shane knapp
jenkins went down some time in the past few days, and i'm currently
investigating.

if it's been down a while, i apologize as i've been dealing w/some health
issues.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-16 Thread shane knapp
alright, my canary build w/skipping the PV integration test passed w/the
docker driver:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-k8s-clone/20/

i'll put together a PR for this over the weekend (it's a one-liner) and
once we merge i can get the remaining workers upgraded early next week.

On Thu, Apr 15, 2021 at 3:05 PM shane knapp ☠  wrote:

> i'm all for that...  and once they're turned off, we can finish the
> minikube/k8s/move-to-docker project in a couple of hours max.
>
> On Thu, Apr 15, 2021 at 3:00 PM Holden Karau  wrote:
>
>> What about if we just turn off the PV tests for now?
>> I'd be happy to help with the debugging/upgrading.
>>
>> On Thu, Apr 15, 2021 at 2:28 AM Rob Vesse  wrote:
>> >
>> > There’s at least one test (the persistent volumes one) that relies on
>> some Minikube functionality because we run integration tests for our
>> $dayjob Spark image builds using Docker for Desktop instead and that one
>> test fails because it relies on some minikube specific functionality.  That
>> test could be refactored because I think it’s just adding a minimal Ceph
>> cluster to the K8S cluster which can be done to any K8S cluster in principal
>> >
>> >
>> >
>> > Rob
>> >
>> >
>> >
>> > From: shane knapp ☠ 
>> > Date: Wednesday, 14 April 2021 at 18:56
>> > To: Frank Luo 
>> > Cc: dev , Brian K Shiratsuki 
>> > Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests
>> >
>> >
>> >
>> > On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:
>> >
>> > Is there any hard dependency on minkube? (i.e, GPU setting), kind (
>> https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a
>> single machine (only requires docker) , it been widely used by k8s projects
>> testing.
>> >
>> >
>> >
>> > there are no hard deps on minikube...  it installs happily and
>> successfully runs every integration test except for persistent volumes.
>> >
>> >
>> >
>> > i haven't tried kind yet, but my time is super limited on this and i'd
>> rather not venture down another rabbit hole unless we absolutely have to.
>> >
>> >
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-15 Thread shane knapp
i'm all for that...  and once they're turned off, we can finish the
minikube/k8s/move-to-docker project in a couple of hours max.

On Thu, Apr 15, 2021 at 3:00 PM Holden Karau  wrote:

> What about if we just turn off the PV tests for now?
> I'd be happy to help with the debugging/upgrading.
>
> On Thu, Apr 15, 2021 at 2:28 AM Rob Vesse  wrote:
> >
> > There’s at least one test (the persistent volumes one) that relies on
> some Minikube functionality because we run integration tests for our
> $dayjob Spark image builds using Docker for Desktop instead and that one
> test fails because it relies on some minikube specific functionality.  That
> test could be refactored because I think it’s just adding a minimal Ceph
> cluster to the K8S cluster which can be done to any K8S cluster in principal
> >
> >
> >
> > Rob
> >
> >
> >
> > From: shane knapp ☠ 
> > Date: Wednesday, 14 April 2021 at 18:56
> > To: Frank Luo 
> > Cc: dev , Brian K Shiratsuki 
> > Subject: Re: [SPARK-34738] issues w/k8s+minikube and PV tests
> >
> >
> >
> > On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:
> >
> > Is there any hard dependency on minkube? (i.e, GPU setting), kind (
> https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a
> single machine (only requires docker) , it been widely used by k8s projects
> testing.
> >
> >
> >
> > there are no hard deps on minikube...  it installs happily and
> successfully runs every integration test except for persistent volumes.
> >
> >
> >
> > i haven't tried kind yet, but my time is super limited on this and i'd
> rather not venture down another rabbit hole unless we absolutely have to.
> >
> >
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: please read: current state and the future of the apache spark build system

2021-04-14 Thread shane knapp
>
> medium term (in 6 months):
> * prepare jenkins worker ansible configs and stick in the spark repo
>   - nothing fancy, but enough to config ubuntu workers
>   - could be used to create docker containers for testing in
> THE CLOUD
>
> fwiw, i just decided to bang this out today:
https://github.com/apache/spark/pull/32178

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-14 Thread shane knapp
On Wed, Apr 14, 2021 at 10:32 AM Frank Luo  wrote:

> Is there any hard dependency on minkube? (i.e, GPU setting), kind (
> https://kind.sigs.k8s.io/) is a stabler and simpler k8s cluster env on a
> single machine (only requires docker) , it been widely used by k8s projects
> testing.
>
> there are no hard deps on minikube...  it installs happily and
successfully runs every integration test except for persistent volumes.

i haven't tried kind yet, but my time is super limited on this and i'd
rather not venture down another rabbit hole unless we absolutely have to.


[SPARK-34738] issues w/k8s+minikube and PV tests

2021-04-14 Thread shane knapp
please see:  https://issues.apache.org/jira/browse/SPARK-34738

i could really use a hand.  all k8s integration tests are currently broken,
and i'd rather spend the time fixing the latest version of minikube, k8s
and the docker virtualization layer than debug the 'old' way which uses the
kvm2/qemu virtualization layer.

thanks in advance,

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-08 Thread shane knapp
On Wed, Apr 7, 2021 at 6:30 AM Hyukjin Kwon  wrote:

> Thanks Martin for your feedback.
>
> > What was your reason to migrate from Apache Jenkins to Github Actions ?
>
> I am sure there were more reasons for migrating from Amplap Jenkins
> <https://amplab.cs.berkeley.edu/jenkins/> to GitHub Actions but as far as
> I can remember:
> - To reduce the maintenance cost of machines
> - The Jenkins machines became unstable and slow causing CI jobs to fail or
> be very flaky.
> - Difficulty to manage the installed libraries.
> - Intermittent unknown issues in the machines
>
> also:

- uc berkeley has been hosting the build system for spark for ~10 years
"free of charge"
- funding for the build system is going away (amplab funded first, riselab
second)
- i have been managing the build system solo for 7 years and my job is much
different now...
- since there are no funds coming from research labs, i am unable to staff
the build system past 2021 (tbh, even this year is a stretch)
- the hardware is far past EOL and literally falling over
- jenkins is, and always will be a PITA to run

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


please read: current state and the future of the apache spark build system

2021-04-07 Thread shane knapp
this will be a relatively big update, as there are many many moving pieces
with short, medium and long term goals.

TLDR1:   we're shutting jenkins down at the end of 2021.

TLDR2:  i know we're way behind on pretty much everything.  most of the
hardware is at or beyond EOL, and random systemic build failures (like
k8s/minikube) are randomly popping up.  i've had to restrict access due to
new campus policies, and i will be dealing with that shortly and only for a
few contributors.

long term (until EOY):
* decide what the future of spark builds and releases will look like
  - do we need jenkins?
  - if we do, who's responsible for hosting + ops?
* we will permanently shut down amplab jenkins by the end of 2021
  - uc berkeley has funded this for over 10 years, and both the funds and
staff (only me, for 7 years) are going away.  i'm staying at cal, but have
a much different job now.  :)

medium term (in 6 months):
* prepare jenkins worker ansible configs and stick in the spark repo
  - nothing fancy, but enough to config ubuntu workers
  - could be used to create docker containers for testing in
THE CLOUD
* train up brian shiratsuki (cced) to help w/ops tasks and upgrades over
the next ~6m
* get to all of the python version, library installation, etc etc jira
requests

short term(weeks):
* debug and figure out why minikube/k8s broke
  - https://issues.apache.org/jira/browse/SPARK-34738
  - i really could use some help here...
* bring up additional workers
  - finish hardware/system level repairs on the bare metal
  - see above, re k8s jira
* stabilize cluster
  - recent jenkins LTS upgrade broke the web GUI
  - finish deploying monitoring/alerting
  - this hardware is OLD and literally falling over, so we have lots of
random disk and ram failures.  it's literally whack-a-mole and each trip to
the colo to repair literally takes a full day

i'm only able to spend a few hours a week on the build system, so expect
random downtime, reboots, restarts, and testing.  we're testing new nodes
as we deploy, and hoping to fix anything before releasing them into the
wild, but some things might be flaky.

but the biggest question is what you all need w/regards to build
infrastructure...  and who's going to be responsible for it.

thanks for reading!  :)

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] short downtime today, new workers coming soon

2021-03-23 Thread shane knapp
we're back!

On Tue, Mar 23, 2021 at 12:31 PM shane knapp ☠  wrote:

> jenkins is acting up, and i'm going to take the opportunity to reboot the
> primary and all the workers.
>
> sorry for the short notice, but on the bright side we have a bunch of
> shiny new workers coming soon!
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] short downtime today, new workers coming soon

2021-03-23 Thread shane knapp
jenkins is acting up, and i'm going to take the opportunity to reboot the
primary and all the workers.

sorry for the short notice, but on the bright side we have a bunch of shiny
new workers coming soon!

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] github fetches timing out

2021-03-17 Thread shane knapp
it's been happening a lot again recently...  i'm investigating.

On Wed, Mar 10, 2021 at 10:23 AM Liang-Chi Hsieh  wrote:

> Thanks Shane for looking at it!
>
>
> shane knapp ☠ wrote
> > ...and just like that, overnight the builds started successfully git
> > fetching!
> >
> > --
> > Shane Knapp
> > Computer Guy / Voice of Reason
> > UC Berkeley EECS Research / RISELab Staff Technical Lead
> > https://rise.cs.berkeley.edu
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] github fetches timing out

2021-03-10 Thread shane knapp
...and just like that, overnight the builds started successfully git
fetching!

On Tue, Mar 9, 2021 at 12:31 PM shane knapp ☠  wrote:

> it looks like over the past few days the master/branch builds have been
> timing out...  this hasn't happened in a few years, and honestly the last
> times this happened there was nothing that either i, or github could do
> about it.  it cleared up after a number of weeks, and we were never able to
> pinpoint the root cause.
>
> we're not hitting a github api ratelimit, and i'm able to successfully run
> the git commands on worker nodes on the command line as the jenkins user.
>
> example:
>
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3-jdk-11/1014/console
>
> i wish i had a more concrete answer or solution for what's going on...
> i'll continue to investigate as best i can today, and if this continues,
> i'll re-open my issue w/github and see if they can shed any light on the
> situation.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] github fetches timing out

2021-03-09 Thread shane knapp
it looks like over the past few days the master/branch builds have been
timing out...  this hasn't happened in a few years, and honestly the last
times this happened there was nothing that either i, or github could do
about it.  it cleared up after a number of weeks, and we were never able to
pinpoint the root cause.

we're not hitting a github api ratelimit, and i'm able to successfully run
the git commands on worker nodes on the command line as the jenkins user.

example:
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-3.0-test-maven-hadoop-2.7-hive-2.3-jdk-11/1014/console

i wish i had a more concrete answer or solution for what's going on...
i'll continue to investigate as best i can today, and if this continues,
i'll re-open my issue w/github and see if they can shed any light on the
situation.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: minikube and kubernetes cluster versions for integration testing

2021-03-04 Thread shane knapp
fwiw, upgrading minikube and the associated VM drivers is potentially a
PITA.

your PR will absolutely be tested before merging.  :)

On Thu, Mar 4, 2021 at 10:13 AM attilapiros 
wrote:

> Thanks Shane!
>
> I can do the documentation task and the Minikube version check can be
> incorporated into my PR.
> When my PR is finalized (probably next week) I will create a jira for you
> and you can set up the test systems and you can even test my PR before
> merging it. Is this possible / fine for you?
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: minikube and kubernetes cluster versions for integration testing

2021-03-03 Thread shane knapp
hould drop everything under v1.3.0.
>>
>> 2) I would suggest to drop v1.15.12 as kubernetes
>> version version because of this issue
>> <https://github.com/kubernetes/minikube/issues/10663> (I just found it
>> by running my script).
>>
>> 3) On Minikube v1.7.2 there is this permission denied issue
>> <https://github.com/kubernetes/minikube/issues/6583> so I suggest to
>> support Minikube version 1.7.3 and greater.
>>
>> My test script is check_minikube_versions.zsh
>> <https://gist.github.com/attilapiros/8648a782e0b956b59f03f914c88c2df3#file-check_minikube_versions-zsh>.
>>  It
>> was executed on Mac but with a simple sed expression it can be tailored to
>> linux too.
>>
>>
>>
>> *After all of this my questions:*
>> *A) What about to change the required versions and suggest to use
>> kubernetes v1.17.3 and Minikube v1.7.3 and greater for integration testing?*
>>
>> I would chose v1.17.3 for k8s cluster as that is the newest supported k8s
>> version for that Minikube v1.7.3 (hoping it will be good for us for a long
>> time).
>> If you agree with this suggestion I go ahead and update the relevant
>> documentation.
>>
>>
>>
>> *B) How about extending the integration test to check whether the
>> Minikube version is sufficient? *By this we can provide a meaningful
>> error when it is violated.
>>
>> Bests,
>> Attila
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins wedged, going to restart after current builds finish

2021-02-23 Thread shane knapp
this was done about an hour ago...  rebooted several of the workers to
clear out lingering builds, and one worker had an SSD fail on boot and is
currently offline.

shane

On Tue, Feb 23, 2021 at 10:13 AM shane knapp ☠  wrote:

> EOM
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins wedged, going to restart after current builds finish

2021-02-23 Thread shane knapp
EOM

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: K8s integration test failure ("credentials Jenkins is using is probably wrong...")

2021-02-23 Thread shane knapp
stupid bash variable assignment.  i'm surprised this has lingered for as
long as it had (3 years).

it's fixed and shouldn't be an issue any more.

On Tue, Feb 23, 2021 at 9:28 AM shane knapp ☠  wrote:

> the AmplabJenks bot's github creds are out of date, which is causing that
> non-fatal error.  however, if you scroll back you'll see that minikube
> actually failed to start.  that should have definitely failed the build, so
> i'll look at the job's bash logic and see what we missed.
>
> also, that worker (research-jenkins-worker-07) had some lingering builds
> running and i bet there was a collision w/a dangling minikube instance.
> i'm rebooting that worker now.
>
> shane
>
>
>
> On Tue, Feb 23, 2021 at 6:47 AM Sean Owen  wrote:
>
>> Shane would you know? May be a problem with a single worker.
>>
>> On Tue, Feb 23, 2021 at 8:46 AM Phillip Henry 
>> wrote:
>>
>>>
>>> Hi,
>>>
>>> Silly question: the Jenkins build for my PR is failing but it seems
>>> outside of my control. What must I do to remedy this?
>>>
>>> I've submitted
>>>
>>> https://github.com/apache/spark/pull/31535
>>>
>>> but Spark QA is telling me "Kubernetes integration test status failure".
>>>
>>> The Jenkins job says "SUCCESS" but also barfs with:
>>>
>>> FileNotFoundException means that the credentials Jenkins is using is 
>>> probably wrong. Or the user account does not have write access to the repo.
>>>
>>>
>>> See
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/consoleFull
>>>
>>> Can anybody please advise?
>>>
>>> Thanks in advance.
>>>
>>> Phillip
>>>
>>>
>>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: K8s integration test failure ("credentials Jenkins is using is probably wrong...")

2021-02-23 Thread shane knapp
the AmplabJenks bot's github creds are out of date, which is causing that
non-fatal error.  however, if you scroll back you'll see that minikube
actually failed to start.  that should have definitely failed the build, so
i'll look at the job's bash logic and see what we missed.

also, that worker (research-jenkins-worker-07) had some lingering builds
running and i bet there was a collision w/a dangling minikube instance.
i'm rebooting that worker now.

shane



On Tue, Feb 23, 2021 at 6:47 AM Sean Owen  wrote:

> Shane would you know? May be a problem with a single worker.
>
> On Tue, Feb 23, 2021 at 8:46 AM Phillip Henry 
> wrote:
>
>>
>> Hi,
>>
>> Silly question: the Jenkins build for my PR is failing but it seems
>> outside of my control. What must I do to remedy this?
>>
>> I've submitted
>>
>> https://github.com/apache/spark/pull/31535
>>
>> but Spark QA is telling me "Kubernetes integration test status failure".
>>
>> The Jenkins job says "SUCCESS" but also barfs with:
>>
>> FileNotFoundException means that the credentials Jenkins is using is 
>> probably wrong. Or the user account does not have write access to the repo.
>>
>>
>> See
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39934/consoleFull
>>
>> Can anybody please advise?
>>
>> Thanks in advance.
>>
>> Phillip
>>
>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [FYI] CI Infra issues (in both GitHub Action and Jenkins)

2021-01-08 Thread shane knapp
no, i don't think that'd be a good idea...  adding additional dependencies
to our cluster won't scale one bit.

On Fri, Jan 8, 2021 at 2:16 PM Dongjoon Hyun 
wrote:

> BTW, Shane, do you think we can utilize some of UCB machines as GitHub
> Action runners?
>
> Bests,
> Dongjoon.
>
> On Fri, Jan 8, 2021 at 2:14 PM Dongjoon Hyun 
> wrote:
>
>> The followings?
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/1836/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/1887/console
>>
>> On Fri, Jan 8, 2021 at 2:13 PM shane knapp ☠  wrote:
>>
>>> 1. Jenkins machines start to fail with the following recently.
>>>> (master branch)
>>>>
>>>> Python versions prior to 3.6 are not supported.
>>>> Build step 'Execute shell' marked build as failure
>>>>
>>>> examples please?
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [FYI] CI Infra issues (in both GitHub Action and Jenkins)

2021-01-08 Thread shane knapp
hmm, the ubuntu16 machines are acting up.  i pinned the sbt master builds
to ubuntu20 and they're happily building while i investigate wtf is up.

On Fri, Jan 8, 2021 at 2:15 PM Dongjoon Hyun 
wrote:

> The followings?
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/1836/console
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/1887/console
>
> On Fri, Jan 8, 2021 at 2:13 PM shane knapp ☠  wrote:
>
>> 1. Jenkins machines start to fail with the following recently.
>>> (master branch)
>>>
>>> Python versions prior to 3.6 are not supported.
>>> Build step 'Execute shell' marked build as failure
>>>
>>> examples please?
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [FYI] CI Infra issues (in both GitHub Action and Jenkins)

2021-01-08 Thread shane knapp
>
> 1. Jenkins machines start to fail with the following recently.
> (master branch)
>
> Python versions prior to 3.6 are not supported.
> Build step 'Execute shell' marked build as failure
>
> examples please?

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins downtime 01/02/2021 - 01/03/2020

2020-12-21 Thread shane knapp
the colo facility where jenkins is hosted is going down for roughly a day
for some (more) power upgrades.  once the colo is powered back up, we'll
make sure that all the jenkins workers and primary nodes are up and happily
building.

if anyone notices any issues w/jenkins before, during or after this event,
please send an email to research-supp...@cs.berkeley.edu and we'll get to
it as quickly as we can[1].

wishing everyone here a happy holiday season,

shane

[1] -- these are for issues w/the build system itself, not for things like
package installs and updates.  keep those on the apache spark jira.  :)
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] WE'RE LIVE!

2020-12-04 Thread shane knapp
ok, it's broken on the new nodes, so i tied the project to ubuntu16.  i'll
create a jira and investigate further at a later date.

On Fri, Dec 4, 2020 at 8:58 AM shane knapp ☠  wrote:

> no, it isn't but i'll try and take a look at this later today.
>
> On Fri, Dec 4, 2020 at 7:12 AM Tom Graves  wrote:
>
>> thanks Shane and folks for great work.
>>
>> Not sure if this is at all related but I noticed the spark master deploy
>> job hasn't been running and the last one Dec 2nd failed:
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/3186/
>>
>> Not sure if this is result of upgrade?
>>
>> Thanks,
>> Tom
>> On Tuesday, December 1, 2020, 06:55:27 PM CST, shane knapp ☠ <
>> skn...@berkeley.edu> wrote:
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/
>>
>> i cleared the build queue, so you'll need to retrigger your PRs.  there
>> will be occasional downtime over the next few days and weeks as we uncover
>> system-level errors and more reimaging happens...  but for now, we're
>> building.
>>
>> a big thanks goes out to jon for his work on the project!  we couldn't
>> have done it w/o him.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] WE'RE LIVE!

2020-12-04 Thread shane knapp
no, it isn't but i'll try and take a look at this later today.

On Fri, Dec 4, 2020 at 7:12 AM Tom Graves  wrote:

> thanks Shane and folks for great work.
>
> Not sure if this is at all related but I noticed the spark master deploy
> job hasn't been running and the last one Dec 2nd failed:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/3186/
>
> Not sure if this is result of upgrade?
>
> Thanks,
> Tom
> On Tuesday, December 1, 2020, 06:55:27 PM CST, shane knapp ☠ <
> skn...@berkeley.edu> wrote:
>
>
> https://amplab.cs.berkeley.edu/jenkins/
>
> i cleared the build queue, so you'll need to retrigger your PRs.  there
> will be occasional downtime over the next few days and weeks as we uncover
> system-level errors and more reimaging happens...  but for now, we're
> building.
>
> a big thanks goes out to jon for his work on the project!  we couldn't
> have done it w/o him.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] WE'RE LIVE!

2020-12-01 Thread shane knapp
https://amplab.cs.berkeley.edu/jenkins/

i cleared the build queue, so you'll need to retrigger your PRs.  there
will be occasional downtime over the next few days and weeks as we uncover
system-level errors and more reimaging happens...  but for now, we're
building.

a big thanks goes out to jon for his work on the project!  we couldn't have
done it w/o him.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins downtime today/tomorrow

2020-12-01 Thread shane knapp
quick update:  the migration to the new primary node is complete and i can
bring up jenkins and it's queueing builds and looks good to go.

the final bits that need attention are SSL, apache2 and firewall configs,
and i'm hoping to get this sorted ASAP.  once that's done, we'll start
building and move on to fixing any lingering environment/system issues that
pop up.

shane

On Mon, Nov 30, 2020 at 4:01 PM shane knapp ☠  wrote:

> amplab jenkins is down.
>
> On Mon, Nov 30, 2020 at 3:25 PM shane knapp ☠  wrote:
>
>> old jenkins is getting shut down Real Soon Now[tm]!  crossing my
>> fingers!  :)
>>
>> On Mon, Nov 30, 2020 at 10:05 AM shane knapp ☠ 
>> wrote:
>>
>>> hey all!
>>>
>>> the Great Jenkins Migration[tm] is well under way, and we will be
>>> sunsetting the old amp-jenkins-master server and moving to a new one.
>>>
>>> i've put jenkins in to quiet mode so that it won't accept new builds and
>>> we'll let the ones currently running finish.  once that's done, i will be
>>> rysncing the entire jenkins installation to the new server and bringing
>>> that up.  we most definitely will have a bunch of minor bugs to knock out,
>>> but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020).
>>>
>>> thanks for your patience, and i'll be sure to send out updates as they
>>> come.
>>>
>>> shane/brian/jon
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins downtime today/tomorrow

2020-11-30 Thread shane knapp
amplab jenkins is down.

On Mon, Nov 30, 2020 at 3:25 PM shane knapp ☠  wrote:

> old jenkins is getting shut down Real Soon Now[tm]!  crossing my fingers!
> :)
>
> On Mon, Nov 30, 2020 at 10:05 AM shane knapp ☠ 
> wrote:
>
>> hey all!
>>
>> the Great Jenkins Migration[tm] is well under way, and we will be
>> sunsetting the old amp-jenkins-master server and moving to a new one.
>>
>> i've put jenkins in to quiet mode so that it won't accept new builds and
>> we'll let the ones currently running finish.  once that's done, i will be
>> rysncing the entire jenkins installation to the new server and bringing
>> that up.  we most definitely will have a bunch of minor bugs to knock out,
>> but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020).
>>
>> thanks for your patience, and i'll be sure to send out updates as they
>> come.
>>
>> shane/brian/jon
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins downtime today/tomorrow

2020-11-30 Thread shane knapp
old jenkins is getting shut down Real Soon Now[tm]!  crossing my fingers!
:)

On Mon, Nov 30, 2020 at 10:05 AM shane knapp ☠  wrote:

> hey all!
>
> the Great Jenkins Migration[tm] is well under way, and we will be
> sunsetting the old amp-jenkins-master server and moving to a new one.
>
> i've put jenkins in to quiet mode so that it won't accept new builds and
> we'll let the ones currently running finish.  once that's done, i will be
> rysncing the entire jenkins installation to the new server and bringing
> that up.  we most definitely will have a bunch of minor bugs to knock out,
> but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020).
>
> thanks for your patience, and i'll be sure to send out updates as they
> come.
>
> shane/brian/jon
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins downtime today/tomorrow

2020-11-30 Thread shane knapp
hey all!

the Great Jenkins Migration[tm] is well under way, and we will be
sunsetting the old amp-jenkins-master server and moving to a new one.

i've put jenkins in to quiet mode so that it won't accept new builds and
we'll let the ones currently running finish.  once that's done, i will be
rysncing the entire jenkins installation to the new server and bringing
that up.  we most definitely will have a bunch of minor bugs to knock out,
but i'm expecting us to be back up and building by EOD tomorrow (12/1/2020).

thanks for your patience, and i'll be sure to send out updates as they come.

shane/brian/jon
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] IMPORTANT UPDATE

2020-11-25 Thread shane knapp
alright, builds are looking solid except for SBT...  if someone here could
take a look at those failures i'd be most appreciative.

the important ones:  PRB, PRB-K8s, k8s, snapshot and maven builds all
green!

i'm literally gobsmacked by how smoothly this went.  :)

we're all going to enjoy a mellow holiday and i'll check build statuses
every now and then and see if i find anything else like this:
https://issues.apache.org/jira/browse/SPARK-33565

have a great holiday everyone!  we'll start getting the new primary set up
on monday, and hopefully by tuesday be fully up and running.

shane


On Wed, Nov 25, 2020 at 1:35 PM shane knapp ☠  wrote:

> hey all, work is going quite well and smoothly for this project.
>
> today's update:
>
> we will experience significant downtime monday/tuesday as we spin up the
> new primary jenkins node.  until then, we'll be building over the next few
> days so i'll have a chance to better track down and fix any system-level
> build breaks.
>
> but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to
> the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult,
> so i'm going to let it pout in the corner for a while before hitting it
> again w/the ansible cannon.
>
> shane
>
> On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠  wrote:
>
>> all spark builds have been ported and triggered:
>>
>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>
>> not shown are the regular and k8s PRB, which are also running.
>>
>> i think i've nailed down most of the stupid PATH and JAVA_HOME issues,
>> but i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye
>> on the build history of research-jenkins-worker-01 and -02, as they're
>> running the latest OS + ansible (which will be moved in to the spark repo
>> asap).
>>
>> i'm still concerned about sbt failures, which includes the PRB.  we'll
>> see how things go, and just focus on getting things working on ubuntu 20
>> LTS.  if we need to drop the ubuntu 16 workers from the pool temporarily, i
>> would be more than happy to do that.  we'll lose some capacity, but it
>> looks like we have a solid template for getting these suckers redeployed so
>> turn-around should be pretty quick.
>>
>> we also need to dedicate some time to clean up/fix our plugin configs.
>> there's been a lot of change over the past three years and things like PRB
>> triggers seem flaky (it took 28m instead of 5m for this job to trigger:
>> https://github.com/apache/spark/pull/29994)
>>
>> this all being said, i'm really happy w/our progress so far and have
>> started leaning towards 'cautiously optimistic'...  we'll see how things go
>> and recalibrate accordingly.  i'll have a better idea of where we are
>> tomorrow and keep the list updated.
>>
>> and finally:  a HUGE thanks goes out to jon for the work going on at the
>> colo this moment:  rack rearrangement, cleaning up networking, fixing
>> hardware, reimaging and generally kicking ass!
>>
>> have a great holiday!
>>
>> shane
>>
>> On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠ 
>> wrote:
>>
>>> our very first ubuntu-based PRB is running:
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/
>>>
>>> crossing my fingers!  :)
>>>
>>> On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ 
>>> wrote:
>>>
>>>> due to scheduling, upcoming holiday and in-the-colo work requirements,
>>>> all of the centos workers are being wiped NOW.
>>>>
>>>> this is great, as the sooner we can get started on fixing builds the
>>>> better.  i'm not going anywhere over the holiday, so i'll get a good
>>>> head-start on things.
>>>>
>>>> thank you jon!
>>>>
>>>> shane
>>>>
>>>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ 
>>>> wrote:
>>>>
>>>>> this is a lengthy, but important read for everyone here.
>>>>>
>>>>> in the next few days, the remaining centos machines (PRB/SBT workers
>>>>> AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>>>>
>>>>> this means three important things on the very near horizon:
>>>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>>>>> 2 -- jenkins itself will be down for a while as we move the jenkins
>>>>> installation to it's new home.
>>>>> 3 -- those of you with accounts here will temporarily lose acces

Re: [build system] IMPORTANT UPDATE

2020-11-25 Thread shane knapp
hey all, work is going quite well and smoothly for this project.

today's update:

we will experience significant downtime monday/tuesday as we spin up the
new primary jenkins node.  until then, we'll be building over the next few
days so i'll have a chance to better track down and fix any system-level
build breaks.

but most importantly, i just added 3 of the 4 new ubuntu 20.04 workers to
the pool:  research-jenkins-worker-03, 04 and 06.  -05 is being difficult,
so i'm going to let it pout in the corner for a while before hitting it
again w/the ansible cannon.

shane

On Tue, Nov 24, 2020 at 6:08 PM shane knapp ☠  wrote:

> all spark builds have been ported and triggered:
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> not shown are the regular and k8s PRB, which are also running.
>
> i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but
> i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on
> the build history of research-jenkins-worker-01 and -02, as they're running
> the latest OS + ansible (which will be moved in to the spark repo asap).
>
> i'm still concerned about sbt failures, which includes the PRB.  we'll see
> how things go, and just focus on getting things working on ubuntu 20 LTS.
> if we need to drop the ubuntu 16 workers from the pool temporarily, i would
> be more than happy to do that.  we'll lose some capacity, but it looks like
> we have a solid template for getting these suckers redeployed so
> turn-around should be pretty quick.
>
> we also need to dedicate some time to clean up/fix our plugin configs.
> there's been a lot of change over the past three years and things like PRB
> triggers seem flaky (it took 28m instead of 5m for this job to trigger:
> https://github.com/apache/spark/pull/29994)
>
> this all being said, i'm really happy w/our progress so far and have
> started leaning towards 'cautiously optimistic'...  we'll see how things go
> and recalibrate accordingly.  i'll have a better idea of where we are
> tomorrow and keep the list updated.
>
> and finally:  a HUGE thanks goes out to jon for the work going on at the
> colo this moment:  rack rearrangement, cleaning up networking, fixing
> hardware, reimaging and generally kicking ass!
>
> have a great holiday!
>
> shane
>
> On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠  wrote:
>
>> our very first ubuntu-based PRB is running:
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/
>>
>> crossing my fingers!  :)
>>
>> On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠ 
>> wrote:
>>
>>> due to scheduling, upcoming holiday and in-the-colo work requirements,
>>> all of the centos workers are being wiped NOW.
>>>
>>> this is great, as the sooner we can get started on fixing builds the
>>> better.  i'm not going anywhere over the holiday, so i'll get a good
>>> head-start on things.
>>>
>>> thank you jon!
>>>
>>> shane
>>>
>>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ 
>>> wrote:
>>>
>>>> this is a lengthy, but important read for everyone here.
>>>>
>>>> in the next few days, the remaining centos machines (PRB/SBT workers
>>>> AND primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>>>
>>>> this means three important things on the very near horizon:
>>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>>>> 2 -- jenkins itself will be down for a while as we move the jenkins
>>>> installation to it's new home.
>>>> 3 -- those of you with accounts here will temporarily lose access
>>>>
>>>> regarding (1), brian (cced) will be helping me debug and fix any
>>>> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
>>>> doing the reimaging and cobbling together of hardware to keep us on our
>>>> feet.  their help is going to be invaluable to getting us back on the
>>>> ground.
>>>>
>>>> we already have two ubuntu 20 workers up and building
>>>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
>>>> is already green.  i'll keep an eye on these workers to ensure i didn't
>>>> miss anything.
>>>>
>>>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB
>>>> and SBT builds there and let them fail as often as possible so we can use
>>>> the build logs during the migration of the primary.
>>>>
>>>> then we shut down jenkins and move to the new

Re: [build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp
all spark builds have been ported and triggered:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/

not shown are the regular and k8s PRB, which are also running.

i think i've nailed down most of the stupid PATH and JAVA_HOME issues, but
i'm sure we'll have some stuff to work out.  i'm mostly keeping an eye on
the build history of research-jenkins-worker-01 and -02, as they're running
the latest OS + ansible (which will be moved in to the spark repo asap).

i'm still concerned about sbt failures, which includes the PRB.  we'll see
how things go, and just focus on getting things working on ubuntu 20 LTS.
if we need to drop the ubuntu 16 workers from the pool temporarily, i would
be more than happy to do that.  we'll lose some capacity, but it looks like
we have a solid template for getting these suckers redeployed so
turn-around should be pretty quick.

we also need to dedicate some time to clean up/fix our plugin configs.
there's been a lot of change over the past three years and things like PRB
triggers seem flaky (it took 28m instead of 5m for this job to trigger:
https://github.com/apache/spark/pull/29994)

this all being said, i'm really happy w/our progress so far and have
started leaning towards 'cautiously optimistic'...  we'll see how things go
and recalibrate accordingly.  i'll have a better idea of where we are
tomorrow and keep the list updated.

and finally:  a HUGE thanks goes out to jon for the work going on at the
colo this moment:  rack rearrangement, cleaning up networking, fixing
hardware, reimaging and generally kicking ass!

have a great holiday!

shane

On Tue, Nov 24, 2020 at 2:24 PM shane knapp ☠  wrote:

> our very first ubuntu-based PRB is running:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/
>
> crossing my fingers!  :)
>
> On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠  wrote:
>
>> due to scheduling, upcoming holiday and in-the-colo work requirements,
>> all of the centos workers are being wiped NOW.
>>
>> this is great, as the sooner we can get started on fixing builds the
>> better.  i'm not going anywhere over the holiday, so i'll get a good
>> head-start on things.
>>
>> thank you jon!
>>
>> shane
>>
>> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ 
>> wrote:
>>
>>> this is a lengthy, but important read for everyone here.
>>>
>>> in the next few days, the remaining centos machines (PRB/SBT workers AND
>>> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>>
>>> this means three important things on the very near horizon:
>>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>>> 2 -- jenkins itself will be down for a while as we move the jenkins
>>> installation to it's new home.
>>> 3 -- those of you with accounts here will temporarily lose access
>>>
>>> regarding (1), brian (cced) will be helping me debug and fix any
>>> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
>>> doing the reimaging and cobbling together of hardware to keep us on our
>>> feet.  their help is going to be invaluable to getting us back on the
>>> ground.
>>>
>>> we already have two ubuntu 20 workers up and building
>>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
>>> is already green.  i'll keep an eye on these workers to ensure i didn't
>>> miss anything.
>>>
>>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB
>>> and SBT builds there and let them fail as often as possible so we can use
>>> the build logs during the migration of the primary.
>>>
>>> then we shut down jenkins and move to the new primary.
>>>
>>> this will all be happening in the next week to week-and-a-half.
>>>
>>> nearish on the horizon, we need to do two things:
>>> 1 -- reimage the ubuntu 16 workers
>>> 2 -- clean up the all of the breakages within jenkins plugin universe.
>>> there's a lot of stacktraces everywhere after the upgrade, but things are
>>> still building so i'm inclined to push this out.
>>> 3 -- fix the PRB/SBT builds.
>>>
>>> further off, once we're stable, we (the spark community) will need to
>>> have an honest conversation about where the build system lives.  we don't
>>> currently have enough resources here to manage the system in a way that it
>>> deserves, and i can't forsee getting the staffing for long-term support any
>>> time soon.
>>>
>>> however, with the ansible configs (which i plan on moving to the spark
>>> repo), it should be m

Re: [build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp
our very first ubuntu-based PRB is running:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131701/

crossing my fingers!  :)

On Tue, Nov 24, 2020 at 1:30 PM shane knapp ☠  wrote:

> due to scheduling, upcoming holiday and in-the-colo work requirements, all
> of the centos workers are being wiped NOW.
>
> this is great, as the sooner we can get started on fixing builds the
> better.  i'm not going anywhere over the holiday, so i'll get a good
> head-start on things.
>
> thank you jon!
>
> shane
>
> On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠ 
> wrote:
>
>> this is a lengthy, but important read for everyone here.
>>
>> in the next few days, the remaining centos machines (PRB/SBT workers AND
>> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>>
>> this means three important things on the very near horizon:
>> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
>> 2 -- jenkins itself will be down for a while as we move the jenkins
>> installation to it's new home.
>> 3 -- those of you with accounts here will temporarily lose access
>>
>> regarding (1), brian (cced) will be helping me debug and fix any
>> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
>> doing the reimaging and cobbling together of hardware to keep us on our
>> feet.  their help is going to be invaluable to getting us back on the
>> ground.
>>
>> we already have two ubuntu 20 workers up and building
>> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
>> is already green.  i'll keep an eye on these workers to ensure i didn't
>> miss anything.
>>
>> once we have a couple of more ubuntu 20 machines up, i'll move the PRB
>> and SBT builds there and let them fail as often as possible so we can use
>> the build logs during the migration of the primary.
>>
>> then we shut down jenkins and move to the new primary.
>>
>> this will all be happening in the next week to week-and-a-half.
>>
>> nearish on the horizon, we need to do two things:
>> 1 -- reimage the ubuntu 16 workers
>> 2 -- clean up the all of the breakages within jenkins plugin universe.
>> there's a lot of stacktraces everywhere after the upgrade, but things are
>> still building so i'm inclined to push this out.
>> 3 -- fix the PRB/SBT builds.
>>
>> further off, once we're stable, we (the spark community) will need to
>> have an honest conversation about where the build system lives.  we don't
>> currently have enough resources here to manage the system in a way that it
>> deserves, and i can't forsee getting the staffing for long-term support any
>> time soon.
>>
>> however, with the ansible configs (which i plan on moving to the spark
>> repo), it should be much easier to replicate the build system.
>>
>> by this time next year, i would like to have helped find the build system
>> a new home, and sunset jenkins.  over the past 11 years (i think), this
>> system has built spark.  it's getting a little tired and needs a well
>> deserved break.  :)
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp
due to scheduling, upcoming holiday and in-the-colo work requirements, all
of the centos workers are being wiped NOW.

this is great, as the sooner we can get started on fixing builds the
better.  i'm not going anywhere over the holiday, so i'll get a good
head-start on things.

thank you jon!

shane

On Tue, Nov 24, 2020 at 11:24 AM shane knapp ☠  wrote:

> this is a lengthy, but important read for everyone here.
>
> in the next few days, the remaining centos machines (PRB/SBT workers AND
> primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.
>
> this means three important things on the very near horizon:
> 1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
> 2 -- jenkins itself will be down for a while as we move the jenkins
> installation to it's new home.
> 3 -- those of you with accounts here will temporarily lose access
>
> regarding (1), brian (cced) will be helping me debug and fix any
> system-level bugs (python envs, missing packages, etc).  jon (cced) will be
> doing the reimaging and cobbling together of hardware to keep us on our
> feet.  their help is going to be invaluable to getting us back on the
> ground.
>
> we already have two ubuntu 20 workers up and building
> (research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
> is already green.  i'll keep an eye on these workers to ensure i didn't
> miss anything.
>
> once we have a couple of more ubuntu 20 machines up, i'll move the PRB and
> SBT builds there and let them fail as often as possible so we can use the
> build logs during the migration of the primary.
>
> then we shut down jenkins and move to the new primary.
>
> this will all be happening in the next week to week-and-a-half.
>
> nearish on the horizon, we need to do two things:
> 1 -- reimage the ubuntu 16 workers
> 2 -- clean up the all of the breakages within jenkins plugin universe.
> there's a lot of stacktraces everywhere after the upgrade, but things are
> still building so i'm inclined to push this out.
> 3 -- fix the PRB/SBT builds.
>
> further off, once we're stable, we (the spark community) will need to have
> an honest conversation about where the build system lives.  we don't
> currently have enough resources here to manage the system in a way that it
> deserves, and i can't forsee getting the staffing for long-term support any
> time soon.
>
> however, with the ansible configs (which i plan on moving to the spark
> repo), it should be much easier to replicate the build system.
>
> by this time next year, i would like to have helped find the build system
> a new home, and sunset jenkins.  over the past 11 years (i think), this
> system has built spark.  it's getting a little tired and needs a well
> deserved break.  :)
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-24 Thread shane knapp
i just added it to the PRB config.

On Tue, Nov 24, 2020 at 2:12 AM Yuming Wang  wrote:

> Hi Shane,
>
> Did you set :export LANG=en_US.UTF-8? Some test seems failed because of
> this issue:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/131631/testReport/
>
> Please see https://issues.apache.org/jira/browse/SPARK-27177 for more
> details.
>
> On Tue, Nov 24, 2020 at 8:23 AM shane knapp ☠  wrote:
>
>> it seems that the plugin upgrade went as smoothly as it could have...  i
>> still have a bunch of stack traces to filter through and see if anything is
>> really broken but it's looking pretty good and things are building.
>>
>> if you see any bad behavior from jenkins, don't hesitate to file a jira
>> and ping me here.
>>
>> also, my backlog of things i need to install will be addressed this
>> week.  the ansible is coming along nicely!
>>
>> On Mon, Nov 23, 2020 at 2:11 PM shane knapp ☠ 
>> wrote:
>>
>>> the third most terrifying event in the world, a massive jenkins plugin
>>> update is happening in a couple of hours.  i'm going to restart jenkins and
>>> start working out any bugs/issues that pop up.
>>>
>>> this could be short, or quite long.  i'm guessing somewhere in the
>>> middle.  no new builds will be kicked off starting now.
>>>
>>> in parallel, i'm about to start porting my ansible to ubuntu 20 and
>>> testing that on two freshly reinstalled workers.  the ultimate goal is to
>>> get the PRB running on ubuntu 20...   the sbt tests will also likely be
>>> broken as i've never been able to work on ubuntu 16, 18 or 20.
>>>
>>> shane
>>>
>>> On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠ 
>>> wrote:
>>>
>>>> somehow that went pretty smoothly, tho i've got a bunch of plugins to
>>>> deal with...  we're back up and building w/a shiny new UI.  :)
>>>>
>>>> On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠ 
>>>> wrote:
>>>>
>>>>> this is starting now
>>>>>
>>>>> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
>>>>> wrote:
>>>>>
>>>>>> i'm going to be upgrading jenkins to something more reasonable, and
>>>>>> there will definitely be some downtime as i get things sorted.
>>>>>>
>>>>>> we should be back up and building by monday.
>>>>>>
>>>>>> shane
>>>>>> --
>>>>>> Shane Knapp
>>>>>> Computer Guy / Voice of Reason
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> Computer Guy / Voice of Reason
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] IMPORTANT UPDATE

2020-11-24 Thread shane knapp
this is a lengthy, but important read for everyone here.

in the next few days, the remaining centos machines (PRB/SBT workers AND
primary) will have be reimaged from centos6.9 to ubuntu 20.04LTS.

this means three important things on the very near horizon:
1 -- the PRB and SBT tests WILL BE BROKEN (by thanksgiving)
2 -- jenkins itself will be down for a while as we move the jenkins
installation to it's new home.
3 -- those of you with accounts here will temporarily lose access

regarding (1), brian (cced) will be helping me debug and fix any
system-level bugs (python envs, missing packages, etc).  jon (cced) will be
doing the reimaging and cobbling together of hardware to keep us on our
feet.  their help is going to be invaluable to getting us back on the
ground.

we already have two ubuntu 20 workers up and building
(research-jenkins-worker-0[1,2]), and the SparkPullRequestBuilder-K8s build
is already green.  i'll keep an eye on these workers to ensure i didn't
miss anything.

once we have a couple of more ubuntu 20 machines up, i'll move the PRB and
SBT builds there and let them fail as often as possible so we can use the
build logs during the migration of the primary.

then we shut down jenkins and move to the new primary.

this will all be happening in the next week to week-and-a-half.

nearish on the horizon, we need to do two things:
1 -- reimage the ubuntu 16 workers
2 -- clean up the all of the breakages within jenkins plugin universe.
there's a lot of stacktraces everywhere after the upgrade, but things are
still building so i'm inclined to push this out.
3 -- fix the PRB/SBT builds.

further off, once we're stable, we (the spark community) will need to have
an honest conversation about where the build system lives.  we don't
currently have enough resources here to manage the system in a way that it
deserves, and i can't forsee getting the staffing for long-term support any
time soon.

however, with the ansible configs (which i plan on moving to the spark
repo), it should be much easier to replicate the build system.

by this time next year, i would like to have helped find the build system a
new home, and sunset jenkins.  over the past 11 years (i think), this
system has built spark.  it's getting a little tired and needs a well
deserved break.  :)

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-23 Thread shane knapp
it seems that the plugin upgrade went as smoothly as it could have...  i
still have a bunch of stack traces to filter through and see if anything is
really broken but it's looking pretty good and things are building.

if you see any bad behavior from jenkins, don't hesitate to file a jira and
ping me here.

also, my backlog of things i need to install will be addressed this week.
the ansible is coming along nicely!

On Mon, Nov 23, 2020 at 2:11 PM shane knapp ☠  wrote:

> the third most terrifying event in the world, a massive jenkins plugin
> update is happening in a couple of hours.  i'm going to restart jenkins and
> start working out any bugs/issues that pop up.
>
> this could be short, or quite long.  i'm guessing somewhere in the
> middle.  no new builds will be kicked off starting now.
>
> in parallel, i'm about to start porting my ansible to ubuntu 20 and
> testing that on two freshly reinstalled workers.  the ultimate goal is to
> get the PRB running on ubuntu 20...   the sbt tests will also likely be
> broken as i've never been able to work on ubuntu 16, 18 or 20.
>
> shane
>
> On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠  wrote:
>
>> somehow that went pretty smoothly, tho i've got a bunch of plugins to
>> deal with...  we're back up and building w/a shiny new UI.  :)
>>
>> On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠ 
>> wrote:
>>
>>> this is starting now
>>>
>>> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
>>> wrote:
>>>
>>>> i'm going to be upgrading jenkins to something more reasonable, and
>>>> there will definitely be some downtime as i get things sorted.
>>>>
>>>> we should be back up and building by monday.
>>>>
>>>> shane
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-23 Thread shane knapp
the third most terrifying event in the world, a massive jenkins plugin
update is happening in a couple of hours.  i'm going to restart jenkins and
start working out any bugs/issues that pop up.

this could be short, or quite long.  i'm guessing somewhere in the middle.
no new builds will be kicked off starting now.

in parallel, i'm about to start porting my ansible to ubuntu 20 and testing
that on two freshly reinstalled workers.  the ultimate goal is to get the
PRB running on ubuntu 20...   the sbt tests will also likely be broken as
i've never been able to work on ubuntu 16, 18 or 20.

shane

On Sat, Nov 21, 2020 at 4:23 PM shane knapp ☠  wrote:

> somehow that went pretty smoothly, tho i've got a bunch of plugins to deal
> with...  we're back up and building w/a shiny new UI.  :)
>
> On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠  wrote:
>
>> this is starting now
>>
>> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠ 
>> wrote:
>>
>>> i'm going to be upgrading jenkins to something more reasonable, and
>>> there will definitely be some downtime as i get things sorted.
>>>
>>> we should be back up and building by monday.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-21 Thread shane knapp
somehow that went pretty smoothly, tho i've got a bunch of plugins to deal
with...  we're back up and building w/a shiny new UI.  :)

On Sat, Nov 21, 2020 at 3:52 PM shane knapp ☠  wrote:

> this is starting now
>
> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠  wrote:
>
>> i'm going to be upgrading jenkins to something more reasonable, and there
>> will definitely be some downtime as i get things sorted.
>>
>> we should be back up and building by monday.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: jenkins downtime tomorrow evening/weekend

2020-11-21 Thread shane knapp
this is starting now

On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠  wrote:

> i'm going to be upgrading jenkins to something more reasonable, and there
> will definitely be some downtime as i get things sorted.
>
> we should be back up and building by monday.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


jenkins downtime tomorrow evening/weekend

2020-11-19 Thread shane knapp
i'm going to be upgrading jenkins to something more reasonable, and there
will definitely be some downtime as i get things sorted.

we should be back up and building by monday.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] IMPORTANT: builds will be impacted this month

2020-11-02 Thread shane knapp
TL;DR:  our build system is ancient, EOLed and about to get hit hard w/a
secops hammer.  we need to literally reinstall the entire cluster from
scratch and get things working.

here are the high level bullet points about what's coming up in the next
month:

** all amp-jenkins-worker-* nodes are running centos 6, and the remainder
ubuntu 16.  these will be upgraded to ubuntu 20.

i will be doing this in stages so as to minimize downtime.

** ALL BUILDS NEED TO BE PORTED TO UBUNTU 20.  i can ensure that the
environments on the nodes are identical, but i have yet been able to
successfully build any SBT jobs on any version of ubuntu, and the MVN
builds won't run on ubuntu 18 (tho they work fine on 16).  i also have had
difficulty getting the PRB job to successfully finish on ubuntu.

for this, i will definitely need help from the dev community to get things
working...  and the speed at which things are fixed will be inversely
proportional to how much help i get.  :)

** amplab jenkins primary node will need two major upgrades:  OS from
centos 6 to ubuntu 20, and jenkins from 1.6 to 2.X LTS...

i'm most concerned about this, as it is literally the exact same jenkins
installtion that patrick wendell set up over 10 years ago.  there are many
publish secrets that are entered in to the jenkins config and i'd really
hope that we don't lose them.

my plan here is to upgrade the current jenkins, and fix any things that
break.  then we'll rsync jenkins' homedir to the new primary node and hope
that works.  :)

** user audits

UC berkeley's new security standards require quarterly audits of
non-affiliated accounts...  this won't impact only but a few people on this
list, but i'll need to work w/campus and our department on solutions for
this other than local accounts on the servers.

a LOT is going to happen, and i'm meeting w/my team today and will come up
w/a basic plan.  we will definitely experience downtime during this, but i
cannot guess as to what that will look like.

this might also be a good time to talk about the future of the build
system, auditing our builds (do we need SBT?), or even finally getting
around to dockerizing everything  so i don't need such a fragile and
non-atomic set of worker nodes specifically for spark.

thoughts?  comments?

shane

ps -- this is one of the reasons why i haven't been around much lately...
it's been really tough keeping things up to date while trying to remotely
train up one of my sysadmins to take over some of my build system duties.
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins wedged again

2020-10-14 Thread shane knapp
everything's up and jenkins is slowly chewing through the queue!  :)

On Wed, Oct 14, 2020 at 12:00 PM Xiao Li  wrote:

> Thank you, Shane!
>
> Xiao
>
> On Wed, Oct 14, 2020 at 12:00 PM shane knapp ☠ 
> wrote:
>
>> we're mostly back up, and just waiting for a couple of ubuntu boxes to
>> finish booting...  prb seem to be building now!
>>
>> On Wed, Oct 14, 2020 at 11:48 AM shane knapp ☠ 
>> wrote:
>>
>>> i'm going to reboot the primary and worker nodes, so it'll be a few
>>> minutes before everything is back up.
>>>
>>> shane
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] jenkins wedged again

2020-10-14 Thread shane knapp
we're mostly back up, and just waiting for a couple of ubuntu boxes to
finish booting...  prb seem to be building now!

On Wed, Oct 14, 2020 at 11:48 AM shane knapp ☠  wrote:

> i'm going to reboot the primary and worker nodes, so it'll be a few
> minutes before everything is back up.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] jenkins wedged again

2020-10-14 Thread shane knapp
i'm going to reboot the primary and worker nodes, so it'll be a few minutes
before everything is back up.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Running K8s integration tests for changes in core?

2020-09-24 Thread shane knapp
just revisiting this thread...

re presubmit strategy:  i don't think this would be easy to set up...  and
i'm not sure what benefit it will give us.

re inadvertent errors:  since we're checking out the same hash from the PR
for both builds, and they'll run simultaneously, i don't think it'll be an
issue.

re overloading the workers:  nah.  the regular PRB takes ~4hr, and the k8s
PRB takes ~30m and runs in parallel.

i'll set this up right now and keep an eye on the queue/build results today.

shane

On Thu, Aug 20, 2020 at 2:28 PM Holden Karau  wrote:

> Sounds good, thanks for the heads up. I hope you get some time to relax :)
>
> On Thu, Aug 20, 2020 at 2:26 PM shane knapp ☠  wrote:
>
>> fyi, i won't be making this change until the 1st week of september.  i'll
>> be out, off the grid all next week!  :)
>>
>> i will send an announcement out tomorrow on how to contact my team here @
>> uc berkeley if jenkins goes down.
>>
>> shane
>>
>> On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma 
>> wrote:
>>
>>> Another option is, if we could have something like "presubmit" PR build.
>>> In other words, running the entire 4 H + K8s integration on each commit
>>> pushed is too much at the same time and there are chances that one thing
>>> can inadvertently affect other components(as you just said).
>>>
>>> A presubmit(which includes K8s integration tests) build will be run,
>>> once the PR receives LGTM from "Approved reviewers". This is one criteria
>>> that comes to my mind, others may have better suggestions.
>>>
>>> On Thu, Aug 20, 2020 at 12:25 AM shane knapp ☠ 
>>> wrote:
>>>
>>>> we'll be gated by the number of ubuntu workers w/minikube and docker,
>>>> but it shouldn't be too bad as the full integration test takes ~45m, vs 4+
>>>> hrs for the regular PRB.
>>>>
>>>> i can enable this in about 1m of time if the consensus is for us to
>>>> want this.
>>>>
>>>> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau 
>>>> wrote:
>>>>
>>>>> Sounds good. In the meantime would folks committing things in core run
>>>>> the K8s PRB or run it locally? A second change this morning was committed
>>>>> that broke the K8s PR tests.
>>>>>
>>>>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma 
>>>>> wrote:
>>>>>
>>>>>> +1, we should enable.
>>>>>>
>>>>>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Dev Folks,
>>>>>>>
>>>>>>> I was wondering how people feel about enabling the K8s PRB
>>>>>>> automatically for all core changes? Sometimes I forget that a change 
>>>>>>> might
>>>>>>> impact one of the K8s integration tests since a bunch of them look at 
>>>>>>> log
>>>>>>> messages. Would folks be OK with turning on the K8s integration PRB for 
>>>>>>> all
>>>>>>> core changes as well as K8s changes?
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>> Holden :)
>>>>>>>
>>>>>>> --
>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [build system] downtime due to SSL cert errors

2020-09-24 Thread shane knapp
certs delivered and installed...  we're back!

On Wed, Sep 23, 2020 at 6:07 PM shane knapp ☠  wrote:

> jenkins is up and building, but not reachable via https at the moment.
> i'm working on getting this sorted ASAP.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] downtime due to SSL cert errors

2020-09-23 Thread shane knapp
jenkins is up and building, but not reachable via https at the moment.  i'm
working on getting this sorted ASAP.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] shane out all next week (aug 22-29), support instructions

2020-08-20 Thread shane knapp
i will be disappearing off in to the wilderness for a few days of
backpacking, and am handing off basic support duties to my team.

if, and only if, jenkins goes down, please email
research-supp...@cs.berkeley.edu and open a ticket.

if you open a ticket, please let dev@ know to minimize the number of
tickets opened.  :)

if there are any other problems, file a JIRA and assign to me.  i will look
at it in early september.

shane
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Running K8s integration tests for changes in core?

2020-08-20 Thread shane knapp
fyi, i won't be making this change until the 1st week of september.  i'll
be out, off the grid all next week!  :)

i will send an announcement out tomorrow on how to contact my team here @
uc berkeley if jenkins goes down.

shane

On Thu, Aug 20, 2020 at 4:40 AM Prashant Sharma 
wrote:

> Another option is, if we could have something like "presubmit" PR build.
> In other words, running the entire 4 H + K8s integration on each commit
> pushed is too much at the same time and there are chances that one thing
> can inadvertently affect other components(as you just said).
>
> A presubmit(which includes K8s integration tests) build will be run, once
> the PR receives LGTM from "Approved reviewers". This is one criteria that
> comes to my mind, others may have better suggestions.
>
> On Thu, Aug 20, 2020 at 12:25 AM shane knapp ☠ 
> wrote:
>
>> we'll be gated by the number of ubuntu workers w/minikube and docker, but
>> it shouldn't be too bad as the full integration test takes ~45m, vs 4+ hrs
>> for the regular PRB.
>>
>> i can enable this in about 1m of time if the consensus is for us to want
>> this.
>>
>> On Wed, Aug 19, 2020 at 11:37 AM Holden Karau 
>> wrote:
>>
>>> Sounds good. In the meantime would folks committing things in core run
>>> the K8s PRB or run it locally? A second change this morning was committed
>>> that broke the K8s PR tests.
>>>
>>> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma 
>>> wrote:
>>>
>>>> +1, we should enable.
>>>>
>>>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
>>>> wrote:
>>>>
>>>>> Hi Dev Folks,
>>>>>
>>>>> I was wondering how people feel about enabling the K8s PRB
>>>>> automatically for all core changes? Sometimes I forget that a change might
>>>>> impact one of the K8s integration tests since a bunch of them look at log
>>>>> messages. Would folks be OK with turning on the K8s integration PRB for 
>>>>> all
>>>>> core changes as well as K8s changes?
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Running K8s integration tests for changes in core?

2020-08-19 Thread shane knapp
we'll be gated by the number of ubuntu workers w/minikube and docker, but
it shouldn't be too bad as the full integration test takes ~45m, vs 4+ hrs
for the regular PRB.

i can enable this in about 1m of time if the consensus is for us to want
this.

On Wed, Aug 19, 2020 at 11:37 AM Holden Karau  wrote:

> Sounds good. In the meantime would folks committing things in core run the
> K8s PRB or run it locally? A second change this morning was committed that
> broke the K8s PR tests.
>
> On Tue, Aug 18, 2020 at 9:53 PM Prashant Sharma 
> wrote:
>
>> +1, we should enable.
>>
>> On Wed, Aug 19, 2020 at 9:18 AM Holden Karau 
>> wrote:
>>
>>> Hi Dev Folks,
>>>
>>> I was wondering how people feel about enabling the K8s PRB automatically
>>> for all core changes? Sometimes I forget that a change might impact one of
>>> the K8s integration tests since a bunch of them look at log messages. Would
>>> folks be OK with turning on the K8s integration PRB for all core changes as
>>> well as K8s changes?
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Running K8s integration tests for changes in core?

2020-08-18 Thread shane knapp
yes, i think this is fine.  the k8s prb runs concurrently to the regular
prb and takes ~20m.

On Tue, Aug 18, 2020 at 8:47 PM Holden Karau  wrote:

> Hi Dev Folks,
>
> I was wondering how people feel about enabling the K8s PRB automatically
> for all core changes? Sometimes I forget that a change might impact one of
> the K8s integration tests since a bunch of them look at log messages. Would
> folks be OK with turning on the K8s integration PRB for all core changes as
> well as K8s changes?
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


[build system] restarting jenkins now

2020-08-14 Thread shane knapp
there isn't much activity right now, and i'd like to restart jenkins
quickly as it's consuming a lot of memory on the head node.

shouldn't be more than a couple of minutes downtime...  if something goes
awry i'll send an email here.  if you don't hear from me again, please
carry on.  :)
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: R installation broken on ubuntu workers, impacts K8s PRB builds

2020-07-17 Thread shane knapp
this is done, except for amp-jenkins-staging-worker-02 which is refusing to
allow me to reinstall R...  i marked that worker offline and will beat on
it later today.

On Fri, Jul 17, 2020 at 11:36 AM shane knapp ☠  wrote:

> starting now...  pausing jenkins so no new builds are launched.
>
> On Thu, Jul 16, 2020 at 3:09 PM Holden Karau  wrote:
>
>> Sounds good, thanks. No rush :)
>>
>> On Thu, Jul 16, 2020 at 3:03 PM shane knapp ☠ 
>> wrote:
>>
>>> i'll get to this tomorrow afternoon, and there will be a short
>>> downtime.  more details to come.
>>>
>>> On Wed, Jul 15, 2020 at 12:17 PM Holden Karau 
>>> wrote:
>>>
>>>> Oh cool, I filed a JIRA for this already and assigned it to you
>>>> (noticed in one of my PRs)-
>>>> https://issues.apache.org/jira/browse/SPARK-32326
>>>>
>>>> On Wed, Jul 15, 2020 at 12:09 PM shane knapp ☠ 
>>>> wrote:
>>>>
>>>>> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's
>>>>> breaking the k8s builds.
>>>>>
>>>>> i'll need to purge these workers of all previous versions of R +
>>>>> packages, then reinstall from scratch.  this isn't a horrible task as i
>>>>> have most of it automated but it will still require a ~few hours of
>>>>> downtime.
>>>>>
>>>>> i'll file a JIRA, and figure out when i will be able to get to
>>>>> this...  possibly this afternoon.
>>>>> --
>>>>> Shane Knapp
>>>>> Computer Guy / Voice of Reason
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: R installation broken on ubuntu workers, impacts K8s PRB builds

2020-07-17 Thread shane knapp
starting now...  pausing jenkins so no new builds are launched.

On Thu, Jul 16, 2020 at 3:09 PM Holden Karau  wrote:

> Sounds good, thanks. No rush :)
>
> On Thu, Jul 16, 2020 at 3:03 PM shane knapp ☠  wrote:
>
>> i'll get to this tomorrow afternoon, and there will be a short downtime.
>> more details to come.
>>
>> On Wed, Jul 15, 2020 at 12:17 PM Holden Karau 
>> wrote:
>>
>>> Oh cool, I filed a JIRA for this already and assigned it to you (noticed
>>> in one of my PRs)- https://issues.apache.org/jira/browse/SPARK-32326
>>>
>>> On Wed, Jul 15, 2020 at 12:09 PM shane knapp ☠ 
>>> wrote:
>>>
>>>> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's
>>>> breaking the k8s builds.
>>>>
>>>> i'll need to purge these workers of all previous versions of R +
>>>> packages, then reinstall from scratch.  this isn't a horrible task as i
>>>> have most of it automated but it will still require a ~few hours of
>>>> downtime.
>>>>
>>>> i'll file a JIRA, and figure out when i will be able to get to this...
>>>> possibly this afternoon.
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: R installation broken on ubuntu workers, impacts K8s PRB builds

2020-07-16 Thread shane knapp
i'll get to this tomorrow afternoon, and there will be a short downtime.
more details to come.

On Wed, Jul 15, 2020 at 12:17 PM Holden Karau  wrote:

> Oh cool, I filed a JIRA for this already and assigned it to you (noticed
> in one of my PRs)- https://issues.apache.org/jira/browse/SPARK-32326
>
> On Wed, Jul 15, 2020 at 12:09 PM shane knapp ☠ 
> wrote:
>
>> i'm not entirely sure when the dep for R got bumped to 3.5+, but it's
>> breaking the k8s builds.
>>
>> i'll need to purge these workers of all previous versions of R +
>> packages, then reinstall from scratch.  this isn't a horrible task as i
>> have most of it automated but it will still require a ~few hours of
>> downtime.
>>
>> i'll file a JIRA, and figure out when i will be able to get to this...
>> possibly this afternoon.
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


R installation broken on ubuntu workers, impacts K8s PRB builds

2020-07-15 Thread shane knapp
i'm not entirely sure when the dep for R got bumped to 3.5+, but it's
breaking the k8s builds.

i'll need to purge these workers of all previous versions of R + packages,
then reinstall from scratch.  this isn't a horrible task as i have most of
it automated but it will still require a ~few hours of downtime.

i'll file a JIRA, and figure out when i will be able to get to this...
possibly this afternoon.
-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: [DISCUSS] Drop Python 2, 3.4 and 3.5

2020-07-14 Thread shane knapp
this is seriously great news!  let's all take a moment and welcome apache
spark's python support to the present.  ;)

On Mon, Jul 13, 2020 at 7:26 PM Holden Karau  wrote:

> Awesome, thanks you for driving this forward :)
>
> On Mon, Jul 13, 2020 at 7:25 PM Hyukjin Kwon  wrote:
>
>> Thank you all. Python 2, 3.4 and 3.5 are dropped now in the master branch
>> at https://github.com/apache/spark/pull/28957
>>
>> 2020년 7월 3일 (금) 오전 10:01, Hyukjin Kwon 님이 작성:
>>
>>> Thanks Dongjoon. That makes much more sense now!
>>>
>>> 2020년 7월 3일 (금) 오전 12:11, Dongjoon Hyun 님이 작성:
>>>
>>>> Thank you, Hyukjin.
>>>>
>>>> According to the Python community, Python 3.5 is also EOF at 2020-09-13
>>>> (only two months left).
>>>>
>>>> - https://www.python.org/downloads/
>>>>
>>>> So, targeting live Python versions at Apache Spark 3.1.0 (December
>>>> 2020) looks reasonable to me.
>>>>
>>>> For old Python versions, we still have Apache Spark 2.4 LTS and also
>>>> Apache Spark 3.0.x will work.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Wed, Jul 1, 2020 at 10:50 PM Yuanjian Li 
>>>> wrote:
>>>>
>>>>> +1, especially Python 2
>>>>>
>>>>> Holden Karau  于2020年7月2日周四 上午10:20写道:
>>>>>
>>>>>> I’m ok with us dropping Python 2, 3.4, and 3.5 in Spark 3.1 forward.
>>>>>> It will be exciting to get to use more recent Python features. The most
>>>>>> recent Ubuntu LTS ships with 3.7, and while the previous LTS ships with
>>>>>> 3.5, if folks really can’t upgrade there’s conda.
>>>>>>
>>>>>> Is there anyone with a large Python 3.5 fleet who can’t use conda?
>>>>>>
>>>>>> On Wed, Jul 1, 2020 at 7:15 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Yeah, sure. It will be dropped at Spark 3.1 onwards. I don't think
>>>>>>> we should make such changes in maintenance releases
>>>>>>>
>>>>>>> 2020년 7월 2일 (목) 오전 11:13, Holden Karau 님이 작성:
>>>>>>>
>>>>>>>> To be clear the plan is to drop them in Spark 3.1 onwards, yes?
>>>>>>>>
>>>>>>>> On Wed, Jul 1, 2020 at 7:11 PM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I would like to discuss dropping deprecated Python versions 2, 3.4
>>>>>>>>> and 3.5 at https://github.com/apache/spark/pull/28957. I assume
>>>>>>>>> people support it in general
>>>>>>>>> but I am writing this to make sure everybody is happy.
>>>>>>>>>
>>>>>>>>> Fokko made a very good investigation on it, see
>>>>>>>>> https://github.com/apache/spark/pull/28957#issuecomment-652022449.
>>>>>>>>> Assuming from the statistics, I think we're pretty safe to drop
>>>>>>>>> them.
>>>>>>>>> Also note that dropping Python 2 was actually declared at
>>>>>>>>> https://python3statement.org/
>>>>>>>>>
>>>>>>>>> Roughly speaking, there are many main advantages by dropping them:
>>>>>>>>>   1. It removes a bunch of hacks we added around 700 lines in
>>>>>>>>> PySpark.
>>>>>>>>>   2. PyPy2 has a critical bug that causes a flaky test,
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-28358 given my
>>>>>>>>> testing and investigation.
>>>>>>>>>   3. Users can use Python type hints with Pandas UDFs without
>>>>>>>>> thinking about Python version
>>>>>>>>>   4. Users can leverage one latest cloudpickle,
>>>>>>>>> https://github.com/apache/spark/pull/28950. With Python 3.8+ it
>>>>>>>>> can also leverage C pickle.
>>>>>>>>>   5. ...
>>>>>>>>>
>>>>>>>>> So it benefits both users and dev. WDYT guys?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>>>
>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread shane knapp
welcome, all!

On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new roles! The new committers are:
>
> - Huaxin Gao
> - Jungtaek Lim
> - Dilip Biswal
>
> All three of them contributed to Spark 3.0 and we’re excited to have them
> join the project.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-13 Thread shane knapp
alright, the system load graphs show that we've had a generally decreasing
load since friday, and have burned through ~3k builds/day since the reboot
last week!  i don't see many timeouts, and the PRB builds have been
generally green for a couple of days.

again, i will keep an eye on things but i feel we're out of the woods right
now.  :)

shane

On Fri, Jul 10, 2020 at 3:43 PM Frank Yin  wrote:

> Great. Thanks.
>
> On Fri, Jul 10, 2020 at 3:39 PM shane knapp ☠  wrote:
>
>> no, 8 hours is plenty.  things will speed up soon once the backlog of
>> builds works through  i limited the number of PRB builds to 4 per
>> worker, and things are looking better.  let's see how we look next week.
>>
>> On Fri, Jul 10, 2020 at 3:31 PM Frank Yin  wrote:
>>
>>> Can we also increase the build timeout?
>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617
>>> This one fails because it times out, not because of test failures.
>>>
>>> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin  wrote:
>>>
>>>> Yeah, that's what I figured -- those workers are under load. Thanks.
>>>>
>>>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ 
>>>> wrote:
>>>>
>>>>> only 125561, 125562 and 125564 were impacted by -9.
>>>>>
>>>>> 125565 exited w/a code of 15 (143 - 128), which means the process was
>>>>> terminated for unknown reasons.
>>>>>
>>>>> 125563 looks like mima failed due to a bunch of errors.
>>>>>
>>>>> i just spot checked a bunch of recent failed PRB builds from today and
>>>>> they all seemed to be legit.
>>>>>
>>>>> another thing that might be happening is an overload of PRB builds on
>>>>> the workers due to the backlog...  the workers are under a LOT of load
>>>>> right now, and i can put some rate limiting in to see if that helps out.
>>>>>
>>>>> shane
>>>>>
>>>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin 
>>>>> wrote:
>>>>>
>>>>>> Like from build number 125565 to 125561, all impacted by kill -9.
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>>>>>>
>>>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ 
>>>>>> wrote:
>>>>>>
>>>>>>> define "a lot" and provide some links to those builds, please.
>>>>>>> there are roughly 2000 builds per day, and i can't do more than keep a
>>>>>>> cursory eye on things.
>>>>>>>
>>>>>>> the infrastructure that the tests run on hasn't changed one bit on
>>>>>>> any of the workers, and 'kill -9' could be a timeout, flakiness caused 
>>>>>>> by
>>>>>>> old build processes remaining on the workers after the master went 
>>>>>>> down, or
>>>>>>> me trying to clean things up w/o a reboot.  or, perhaps, something wrong
>>>>>>> w/the infra.  :)
>>>>>>>
>>>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that
>>>>>>>> infrastructure?
>>>>>>>>
>>>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> yeah, i can't do much for flaky tests...  just flaky
>>>>>>>>> infrastructure.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Couple of flaky tests can happen. It's usual. Seems it got better
>>

Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp
no, 8 hours is plenty.  things will speed up soon once the backlog of
builds works through  i limited the number of PRB builds to 4 per
worker, and things are looking better.  let's see how we look next week.

On Fri, Jul 10, 2020 at 3:31 PM Frank Yin  wrote:

> Can we also increase the build timeout?
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125617
> This one fails because it times out, not because of test failures.
>
> On Fri, Jul 10, 2020 at 2:16 PM Frank Yin  wrote:
>
>> Yeah, that's what I figured -- those workers are under load. Thanks.
>>
>> On Fri, Jul 10, 2020 at 12:43 PM shane knapp ☠ 
>> wrote:
>>
>>> only 125561, 125562 and 125564 were impacted by -9.
>>>
>>> 125565 exited w/a code of 15 (143 - 128), which means the process was
>>> terminated for unknown reasons.
>>>
>>> 125563 looks like mima failed due to a bunch of errors.
>>>
>>> i just spot checked a bunch of recent failed PRB builds from today and
>>> they all seemed to be legit.
>>>
>>> another thing that might be happening is an overload of PRB builds on
>>> the workers due to the backlog...  the workers are under a LOT of load
>>> right now, and i can put some rate limiting in to see if that helps out.
>>>
>>> shane
>>>
>>> On Fri, Jul 10, 2020 at 11:31 AM Frank Yin  wrote:
>>>
>>>> Like from build number 125565 to 125561, all impacted by kill -9.
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>>>>
>>>> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠ 
>>>> wrote:
>>>>
>>>>> define "a lot" and provide some links to those builds, please.  there
>>>>> are roughly 2000 builds per day, and i can't do more than keep a cursory
>>>>> eye on things.
>>>>>
>>>>> the infrastructure that the tests run on hasn't changed one bit on any
>>>>> of the workers, and 'kill -9' could be a timeout, flakiness caused by old
>>>>> build processes remaining on the workers after the master went down, or me
>>>>> trying to clean things up w/o a reboot.  or, perhaps, something wrong 
>>>>> w/the
>>>>> infra.  :)
>>>>>
>>>>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:
>>>>>
>>>>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that
>>>>>> infrastructure?
>>>>>>
>>>>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
>>>>>> wrote:
>>>>>>
>>>>>>> yeah, i can't do much for flaky tests...  just flaky infrastructure.
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Couple of flaky tests can happen. It's usual. Seems it got better
>>>>>>>> now at least. I will keep monitoring the builds.
>>>>>>>>
>>>>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>>>>>>>>
>>>>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in a
>>>>>>>>> row:
>>>>>>>>>
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>>>>
>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Sent from:
>>>>>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Shane Knapp
>>>>>>> Computer Guy / Voice of Reason
>>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>>> https://rise.cs.berkeley.edu
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> Computer Guy / Voice of Reason
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp
only 125561, 125562 and 125564 were impacted by -9.

125565 exited w/a code of 15 (143 - 128), which means the process was
terminated for unknown reasons.

125563 looks like mima failed due to a bunch of errors.

i just spot checked a bunch of recent failed PRB builds from today and they
all seemed to be legit.

another thing that might be happening is an overload of PRB builds on the
workers due to the backlog...  the workers are under a LOT of load right
now, and i can put some rate limiting in to see if that helps out.

shane

On Fri, Jul 10, 2020 at 11:31 AM Frank Yin  wrote:

> Like from build number 125565 to 125561, all impacted by kill -9.
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125564/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125563/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125562/console
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125561/console
>
> On Fri, Jul 10, 2020 at 9:35 AM shane knapp ☠  wrote:
>
>> define "a lot" and provide some links to those builds, please.  there are
>> roughly 2000 builds per day, and i can't do more than keep a cursory eye on
>> things.
>>
>> the infrastructure that the tests run on hasn't changed one bit on any of
>> the workers, and 'kill -9' could be a timeout, flakiness caused by old
>> build processes remaining on the workers after the master went down, or me
>> trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the
>> infra.  :)
>>
>> On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:
>>
>>> Agree, but I’ve seen a lot of kill by signal 9, assuming that
>>> infrastructure?
>>>
>>> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠ 
>>> wrote:
>>>
>>>> yeah, i can't do much for flaky tests...  just flaky infrastructure.
>>>>
>>>>
>>>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Couple of flaky tests can happen. It's usual. Seems it got better now
>>>>> at least. I will keep monitoring the builds.
>>>>>
>>>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>>>>>
>>>>>> Looks like Jenkins isn't stable still. My PR fails two times in a row:
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>>>
>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp
define "a lot" and provide some links to those builds, please.  there are
roughly 2000 builds per day, and i can't do more than keep a cursory eye on
things.

the infrastructure that the tests run on hasn't changed one bit on any of
the workers, and 'kill -9' could be a timeout, flakiness caused by old
build processes remaining on the workers after the master went down, or me
trying to clean things up w/o a reboot.  or, perhaps, something wrong w/the
infra.  :)

On Fri, Jul 10, 2020 at 9:28 AM Frank Yin  wrote:

> Agree, but I’ve seen a lot of kill by signal 9, assuming that
> infrastructure?
>
> On Fri, Jul 10, 2020 at 8:19 AM shane knapp ☠  wrote:
>
>> yeah, i can't do much for flaky tests...  just flaky infrastructure.
>>
>>
>> On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon 
>> wrote:
>>
>>> Couple of flaky tests can happen. It's usual. Seems it got better now at
>>> least. I will keep monitoring the builds.
>>>
>>> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>>>
>>>> Looks like Jenkins isn't stable still. My PR fails two times in a row:
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>
>>>> ---------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-10 Thread shane knapp
yeah, i can't do much for flaky tests...  just flaky infrastructure.


On Fri, Jul 10, 2020 at 12:41 AM Hyukjin Kwon  wrote:

> Couple of flaky tests can happen. It's usual. Seems it got better now at
> least. I will keep monitoring the builds.
>
> 2020년 7월 10일 (금) 오후 4:33, ukby1234 님이 작성:
>
>> Looks like Jenkins isn't stable still. My PR fails two times in a row:
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125565/console
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125536/testReport
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -----
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread shane knapp
i'm seeing green PRB builds now, so i feel that we've gotten things
building again!  :)

On Thu, Jul 9, 2020 at 5:33 PM Hyukjin Kwon  wrote:

> Thank you Shane.
>
> 2020년 7월 10일 (금) 오전 2:35, shane knapp ☠ 님이 작성:
>
>> and -06 is back!  i'll keep an eye on things today, but suffice to
>> say on each worker i:
>>
>> 1) rebooted
>> 2) cleaned ~/.ivy2, ~/.m2, and other associated caches
>>
>> we should be g2g!  please reply here if you continue to see weirdness.
>>
>> On Thu, Jul 9, 2020 at 10:08 AM shane knapp ☠ 
>> wrote:
>>
>>> ok, we're back up and building (just waiting for one worker, -06 to
>>> finish cleaning itself up).
>>>
>>> On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠ 
>>> wrote:
>>>
>>>> this is happening now.
>>>>
>>>> On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠ 
>>>> wrote:
>>>>
>>>>> this will be happening tomorrow...  today is Meeting Hell Day[tm].
>>>>>
>>>>> On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ 
>>>>> wrote:
>>>>>
>>>>>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick
>>>>>> trip to the colo tomorrow morning.  if not, then first thing thursday.
>>>>>>
>>>>>> --
>>>>>> Shane Knapp
>>>>>> Computer Guy / Voice of Reason
>>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>>> https://rise.cs.berkeley.edu
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Shane Knapp
>>>>> Computer Guy / Voice of Reason
>>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>>> https://rise.cs.berkeley.edu
>>>>>
>>>>
>>>>
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread shane knapp
and -06 is back!  i'll keep an eye on things today, but suffice to say
on each worker i:

1) rebooted
2) cleaned ~/.ivy2, ~/.m2, and other associated caches

we should be g2g!  please reply here if you continue to see weirdness.

On Thu, Jul 9, 2020 at 10:08 AM shane knapp ☠  wrote:

> ok, we're back up and building (just waiting for one worker, -06 to finish
> cleaning itself up).
>
> On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠  wrote:
>
>> this is happening now.
>>
>> On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠  wrote:
>>
>>> this will be happening tomorrow...  today is Meeting Hell Day[tm].
>>>
>>> On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠ 
>>> wrote:
>>>
>>>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick
>>>> trip to the colo tomorrow morning.  if not, then first thing thursday.
>>>>
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread shane knapp
ok, we're back up and building (just waiting for one worker, -06 to finish
cleaning itself up).

On Thu, Jul 9, 2020 at 9:30 AM shane knapp ☠  wrote:

> this is happening now.
>
> On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠  wrote:
>
>> this will be happening tomorrow...  today is Meeting Hell Day[tm].
>>
>> On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠  wrote:
>>
>>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick
>>> trip to the colo tomorrow morning.  if not, then first thing thursday.
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-09 Thread shane knapp
this is happening now.

On Wed, Jul 8, 2020 at 9:07 AM shane knapp ☠  wrote:

> this will be happening tomorrow...  today is Meeting Hell Day[tm].
>
> On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠  wrote:
>
>> i wasn't able to get to it today, so i'm hoping to squeeze in a quick
>> trip to the colo tomorrow morning.  if not, then first thing thursday.
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-08 Thread shane knapp
this will be happening tomorrow...  today is Meeting Hell Day[tm].

On Tue, Jul 7, 2020 at 1:59 PM shane knapp ☠  wrote:

> i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip
> to the colo tomorrow morning.  if not, then first thing thursday.
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


restarting jenkins build system tomorrow (7/8) ~930am PDT

2020-07-07 Thread shane knapp
i wasn't able to get to it today, so i'm hoping to squeeze in a quick trip
to the colo tomorrow morning.  if not, then first thing thursday.

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: m2 cache issues in Jenkins?

2020-07-06 Thread shane knapp
ok, i'm gonna have to reboot all the workers tomorrow and wipe the m2
caches.  it looks like zombie builds were lingering post-jenkins-wedging
and corrupting the repos.

fixed on -05.

On Mon, Jul 6, 2020 at 2:17 PM Jungtaek Lim 
wrote:

> Just encountered the same and it's worker-05 again. (You can find [error]
> in the console to see what's the problem. I guess jetty artifacts in the
> worker might be messed up.)
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125127/consoleFull
>
>
> On Tue, Jul 7, 2020 at 5:35 AM Jungtaek Lim 
> wrote:
>
>> Could this be a flaky or persistent issue? It failed with Scala gendoc
>> but it didn't fail with the part the PR modified. It ran from worker-05.
>>
>>
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/125121/consoleFull
>>
>> On Tue, Jul 7, 2020 at 2:10 AM shane knapp ☠  wrote:
>>
>>> i killed and retriggered the PRB jobs on 04, and wiped that workers' m2
>>> cache.
>>>
>>> On Mon, Jul 6, 2020 at 9:24 AM shane knapp ☠ 
>>> wrote:
>>>
>>>> once the jobs running on that worker are finished, yes.
>>>>
>>>> On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Shane, can we remove .m2 in worker machine 4?
>>>>>
>>>>> 2020년 7월 3일 (금) 오전 8:18, Jungtaek Lim 님이
>>>>> 작성:
>>>>>
>>>>>> Looks like Jenkins service itself becomes unstable. It took
>>>>>> considerable time to just open the test report for a specific build, and
>>>>>> Jenkins doesn't pick the request on rebuild (retest this, please) in 
>>>>>> Github
>>>>>> comment.
>>>>>>
>>>>>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Ah, okay. Actually there already is -
>>>>>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.
>>>>>>>
>>>>>>> 2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:
>>>>>>>
>>>>>>>> We don't I didn't file one originally, but Shane reminded me to in
>>>>>>>> the future.
>>>>>>>>
>>>>>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Nope, do we have an existing ticket? I think we can reopen if
>>>>>>>>> there is.
>>>>>>>>>
>>>>>>>>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>>>>>>>>
>>>>>>>>>> Huh interesting that it’s the same worker. Have you filed a
>>>>>>>>>> ticket to Shane?
>>>>>>>>>>
>>>>>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>>>>>>>>>>
>>>>>>>>>>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이
>>>>>>>>>>> 작성:
>>>>>>>>>>>
>>>>>>>>>>>> done:
>>>>>>>>>>>> -bash-4.1$ cd .m2
>>>>>>>>>>>> -bash-4.1$ ls
>>>>>>>>>>>> repository
>>>>>>>>>>>> -bash-4.1$ time rm -rf *
>>>>>>>>>>>>
>>>>>>>>>>>> real17m4.607s
>>>>>>>>>>>> user0m0.950s
>>>>>>>>>>>> sys 0m18.816s
>>>>>>>>>>>> -bash-4.1$
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ <
>>>>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> ok, i've taken that worker offline and once the job running on
>>>>>>>>>>>>> it finishes, i'll wipe the cache.
>>>>>>>>>>>>>
>>>>>>>>>>>>> in the future, please file a JIRA and assign it to me so i
>>&g

Re: m2 cache issues in Jenkins?

2020-07-06 Thread shane knapp
i killed and retriggered the PRB jobs on 04, and wiped that workers' m2
cache.

On Mon, Jul 6, 2020 at 9:24 AM shane knapp ☠  wrote:

> once the jobs running on that worker are finished, yes.
>
> On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon  wrote:
>
>> Shane, can we remove .m2 in worker machine 4?
>>
>> 2020년 7월 3일 (금) 오전 8:18, Jungtaek Lim 님이
>> 작성:
>>
>>> Looks like Jenkins service itself becomes unstable. It took considerable
>>> time to just open the test report for a specific build, and Jenkins doesn't
>>> pick the request on rebuild (retest this, please) in Github comment.
>>>
>>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon  wrote:
>>>
>>>> Ah, okay. Actually there already is -
>>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.
>>>>
>>>> 2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:
>>>>
>>>>> We don't I didn't file one originally, but Shane reminded me to in the
>>>>> future.
>>>>>
>>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Nope, do we have an existing ticket? I think we can reopen if there
>>>>>> is.
>>>>>>
>>>>>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>>>>>
>>>>>>> Huh interesting that it’s the same worker. Have you filed a ticket
>>>>>>> to Shane?
>>>>>>>
>>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>>>>>>>
>>>>>>>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:
>>>>>>>>
>>>>>>>>> done:
>>>>>>>>> -bash-4.1$ cd .m2
>>>>>>>>> -bash-4.1$ ls
>>>>>>>>> repository
>>>>>>>>> -bash-4.1$ time rm -rf *
>>>>>>>>>
>>>>>>>>> real17m4.607s
>>>>>>>>> user0m0.950s
>>>>>>>>> sys 0m18.816s
>>>>>>>>> -bash-4.1$
>>>>>>>>>
>>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ <
>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>
>>>>>>>>>> ok, i've taken that worker offline and once the job running on it
>>>>>>>>>> finishes, i'll wipe the cache.
>>>>>>>>>>
>>>>>>>>>> in the future, please file a JIRA and assign it to me so i don't
>>>>>>>>>> have to track my work through emails to the dev@ list.  ;)
>>>>>>>>>>
>>>>>>>>>> thanks!
>>>>>>>>>>
>>>>>>>>>> shane
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau <
>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>
>>>>>>>>>>> The most recent one I noticed was
>>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>>>>>>>>>>  which
>>>>>>>>>>> was run on  amp-jenkins-worker-04.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ <
>>>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> for those weird failures, it's super helpful to provide which
>>>>>>>>>>>> workers are showing these issues.  :)
>>>>>>>>>>>>
>>>>>>>>>>>> i'd rather not wipe all of the m2 caches on all of the workers,
>>>>>>>>>>>> as we'll then potentially get blacklisted again if we download too 
>>>>>>>>>>>> many
>>>>>>>>>>>> packages from apache.org.
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau <
>>>>>&

Re: m2 cache issues in Jenkins?

2020-07-06 Thread shane knapp
once the jobs running on that worker are finished, yes.

On Sun, Jul 5, 2020 at 7:41 PM Hyukjin Kwon  wrote:

> Shane, can we remove .m2 in worker machine 4?
>
> 2020년 7월 3일 (금) 오전 8:18, Jungtaek Lim 님이 작성:
>
>> Looks like Jenkins service itself becomes unstable. It took considerable
>> time to just open the test report for a specific build, and Jenkins doesn't
>> pick the request on rebuild (retest this, please) in Github comment.
>>
>> On Thu, Jul 2, 2020 at 2:12 PM Hyukjin Kwon  wrote:
>>
>>> Ah, okay. Actually there already is -
>>> https://issues.apache.org/jira/browse/SPARK-31693. I am reopening.
>>>
>>> 2020년 7월 2일 (목) 오후 2:06, Holden Karau 님이 작성:
>>>
>>>> We don't I didn't file one originally, but Shane reminded me to in the
>>>> future.
>>>>
>>>> On Wed, Jul 1, 2020 at 9:44 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Nope, do we have an existing ticket? I think we can reopen if there is.
>>>>>
>>>>> 2020년 7월 2일 (목) 오후 1:43, Holden Karau 님이 작성:
>>>>>
>>>>>> Huh interesting that it’s the same worker. Have you filed a ticket to
>>>>>> Shane?
>>>>>>
>>>>>> On Wed, Jul 1, 2020 at 8:50 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Hm .. seems this is happening again in amp-jenkins-worker-04 ;(.
>>>>>>>
>>>>>>> 2020년 6월 25일 (목) 오전 3:15, shane knapp ☠ 님이 작성:
>>>>>>>
>>>>>>>> done:
>>>>>>>> -bash-4.1$ cd .m2
>>>>>>>> -bash-4.1$ ls
>>>>>>>> repository
>>>>>>>> -bash-4.1$ time rm -rf *
>>>>>>>>
>>>>>>>> real17m4.607s
>>>>>>>> user0m0.950s
>>>>>>>> sys 0m18.816s
>>>>>>>> -bash-4.1$
>>>>>>>>
>>>>>>>> On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠ 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> ok, i've taken that worker offline and once the job running on it
>>>>>>>>> finishes, i'll wipe the cache.
>>>>>>>>>
>>>>>>>>> in the future, please file a JIRA and assign it to me so i don't
>>>>>>>>> have to track my work through emails to the dev@ list.  ;)
>>>>>>>>>
>>>>>>>>> thanks!
>>>>>>>>>
>>>>>>>>> shane
>>>>>>>>>
>>>>>>>>> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau <
>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>
>>>>>>>>>> The most recent one I noticed was
>>>>>>>>>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>>>>>>>>>  which
>>>>>>>>>> was run on  amp-jenkins-worker-04.
>>>>>>>>>>
>>>>>>>>>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ <
>>>>>>>>>> skn...@berkeley.edu> wrote:
>>>>>>>>>>
>>>>>>>>>>> for those weird failures, it's super helpful to provide which
>>>>>>>>>>> workers are showing these issues.  :)
>>>>>>>>>>>
>>>>>>>>>>> i'd rather not wipe all of the m2 caches on all of the workers,
>>>>>>>>>>> as we'll then potentially get blacklisted again if we download too 
>>>>>>>>>>> many
>>>>>>>>>>> packages from apache.org.
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau <
>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Folks,
>>>>>>>>>>>>
>>>>>>>>>>>> I've been see some weird failures on Jenkins and it looks like
>>>>>>>>>>>> it might be from the m2 cache. Would it be OK to clean it out? Or 
>>>>>>>>>>>> is it
>>>>>>>>&g

Re: Jenkins is down

2020-07-05 Thread shane knapp
hey all, i was out of town for the weekend and noticed it was down this
morning and restarted the service.  it's been pretty flaky recently, so
i'll take a much closer look at things this coming week.

On Sun, Jul 5, 2020 at 1:14 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Now, AmpLab Jenkins farm came back online.
>
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>
> Also, many PRBuilder jobs were re-started 10 minutes ago.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Jul 3, 2020 at 4:43 AM Hyukjin Kwon  wrote:
>
>> Hi all and Shane,
>>
>> Is there something wrong with the Jenkins machines? Seems they are down.
>>
>

-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: m2 cache issues in Jenkins?

2020-06-24 Thread shane knapp
done:
-bash-4.1$ cd .m2
-bash-4.1$ ls
repository
-bash-4.1$ time rm -rf *

real17m4.607s
user0m0.950s
sys 0m18.816s
-bash-4.1$

On Wed, Jun 24, 2020 at 10:50 AM shane knapp ☠  wrote:

> ok, i've taken that worker offline and once the job running on it
> finishes, i'll wipe the cache.
>
> in the future, please file a JIRA and assign it to me so i don't have to
> track my work through emails to the dev@ list.  ;)
>
> thanks!
>
> shane
>
> On Wed, Jun 24, 2020 at 10:48 AM Holden Karau 
> wrote:
>
>> The most recent one I noticed was
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/124437/console
>>  which
>> was run on  amp-jenkins-worker-04.
>>
>> On Wed, Jun 24, 2020 at 10:44 AM shane knapp ☠ 
>> wrote:
>>
>>> for those weird failures, it's super helpful to provide which workers
>>> are showing these issues.  :)
>>>
>>> i'd rather not wipe all of the m2 caches on all of the workers, as we'll
>>> then potentially get blacklisted again if we download too many packages
>>> from apache.org.
>>>
>>> On Tue, Jun 23, 2020 at 5:58 PM Holden Karau 
>>> wrote:
>>>
>>>> Hi Folks,
>>>>
>>>> I've been see some weird failures on Jenkins and it looks like it might
>>>> be from the m2 cache. Would it be OK to clean it out? Or is it important?
>>>>
>>>> Cheers,
>>>>
>>>> Holden
>>>>
>>>> --
>>>> Twitter: https://twitter.com/holdenkarau
>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
Shane Knapp
Computer Guy / Voice of Reason
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


  1   2   3   4   5   6   7   8   >