Re: Kubernetes: why use init containers?

2018-01-09 Thread Anirudh Ramanathan
Marcelo, I can see that we might be misunderstanding what this change
implies for performance and some of the deeper implementation details here.
We have a community meeting tomorrow (at 10am PT), and we'll be sure to
explore this idea in detail, and understand the implications and then get
back to you.

Thanks for the detailed responses here, and for spending time with the idea.
(Also, you're more than welcome to attend the meeting - there's a link
 here if
you're around.)

Cheers,
Anirudh


On Jan 9, 2018 8:05 PM, "Marcelo Vanzin"  wrote:

One thing I forgot in my previous e-mail is that if a resource is
remote I'm pretty sure (but haven't double checked the code) that
executors will download it directly from the remote server, and not
from the driver. So there, distributed download without an init
container.

On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li  wrote:
> The init-container is required for use with the resource staging server
> (https://github.com/apache-spark-on-k8s/userdocs/blob/master
/src/jekyll/running-on-kubernetes.md#resource-staging-server).

If the staging server *requires* an init container you have already a
design problem right there.

> Additionally, the init-container is a Kubernetes
> native way of making sure that the dependencies are localized

Sorry, but the init container does not do anything by itself. You had
to add a whole bunch of code to execute the existing Spark code in an
init container, when not doing it would have achieved the exact same
goal much more easily, in a way that is consistent with how Spark
already does things.

Matt:
> the executors wouldn’t receive the jars on their class loader until after
the executor starts

I actually consider that a benefit. It means spark-on-k8s application
will behave more like all the other backends, where that is true also
(application jars live in a separate class loader).

> traditionally meant to prepare the environment for the application that
is to be run

You guys are forcing this argument when it all depends on where you
draw the line. Spark can be launched without downloading any of those
dependencies, because Spark will download them for you. Forcing the
"kubernetes way" just means you're writing a lot more code, and
breaking the Spark app initialization into multiple container
invocations, to achieve the same thing.

> would make the SparkSubmit code inadvertently allow running client mode
Kubernetes applications as well

Not necessarily. I have that in my patch; it doesn't allow client mode
unless a property that only the cluster mode submission code sets is
present. If some user wants to hack their way around that, more power
to them; users can also compile their own Spark without the checks if
they want to try out client mode in some way.

Anirudh:
> Telling users that they must rebuild images  ... every time seems less
than convincing to me.

Sure, I'm not proposing people use the docker image approach all the
time. It would be a hassle while developing an app, as it is kind of a
hassle today where the code doesn't upload local files to the k8s
cluster.

But it's perfectly reasonable for people to optimize a production app
by bundling the app into a pre-built docker image to avoid
re-downloading resources every time. Like they'd probably place the
jar + dependencies on HDFS today with YARN, to get the benefits of the
YARN cache.

--
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: Kubernetes: why use init containers?

2018-01-09 Thread Marcelo Vanzin
One thing I forgot in my previous e-mail is that if a resource is
remote I'm pretty sure (but haven't double checked the code) that
executors will download it directly from the remote server, and not
from the driver. So there, distributed download without an init
container.

On Tue, Jan 9, 2018 at 7:15 PM, Yinan Li  wrote:
> The init-container is required for use with the resource staging server
> (https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server).

If the staging server *requires* an init container you have already a
design problem right there.

> Additionally, the init-container is a Kubernetes
> native way of making sure that the dependencies are localized

Sorry, but the init container does not do anything by itself. You had
to add a whole bunch of code to execute the existing Spark code in an
init container, when not doing it would have achieved the exact same
goal much more easily, in a way that is consistent with how Spark
already does things.

Matt:
> the executors wouldn’t receive the jars on their class loader until after the 
> executor starts

I actually consider that a benefit. It means spark-on-k8s application
will behave more like all the other backends, where that is true also
(application jars live in a separate class loader).

> traditionally meant to prepare the environment for the application that is to 
> be run

You guys are forcing this argument when it all depends on where you
draw the line. Spark can be launched without downloading any of those
dependencies, because Spark will download them for you. Forcing the
"kubernetes way" just means you're writing a lot more code, and
breaking the Spark app initialization into multiple container
invocations, to achieve the same thing.

> would make the SparkSubmit code inadvertently allow running client mode 
> Kubernetes applications as well

Not necessarily. I have that in my patch; it doesn't allow client mode
unless a property that only the cluster mode submission code sets is
present. If some user wants to hack their way around that, more power
to them; users can also compile their own Spark without the checks if
they want to try out client mode in some way.

Anirudh:
> Telling users that they must rebuild images  ... every time seems less than 
> convincing to me.

Sure, I'm not proposing people use the docker image approach all the
time. It would be a hassle while developing an app, as it is kind of a
hassle today where the code doesn't upload local files to the k8s
cluster.

But it's perfectly reasonable for people to optimize a production app
by bundling the app into a pre-built docker image to avoid
re-downloading resources every time. Like they'd probably place the
jar + dependencies on HDFS today with YARN, to get the benefits of the
YARN cache.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-09 Thread Anirudh Ramanathan
Marcelo, to address the points you raised:

> k8s uses docker images. Users can create docker images with all the
dependencies their app needs, and submit the app using that image.

The entire reason why we support additional methods of localizing
dependencies than baking everything into docker images is that
it's not a very good workflow fit for all use-cases. There are definitely
some users that will do that (and I've spoken to some),
and they build a versioned image in their registry every time they change
their code with a CD pipeline,
but a lot of people are looking for something lighter - and versioning
application code, not entire images.
Telling users that they must rebuild images and pay the cost of localizing
new images from the docker registry
(which is also not very well understood/measured in terms of performance)
every time seems less than convincing to me.

> - The original spark-on-k8s spec mentioned a "dependency server"
approach which sounded like a more generic version of the YARN
distributed cache, which I hope can be a different way of mitigating
that issue. With that work, we could build this functionality into
spark-submit itself and have other backends also benefit.

The resource staging server as was written was a non-HA fileserver for
staging dependencies within the cluster.
It's not distributed, and has no notion of locality, etc. I don't think we
had plans (yet) to invest in to make it more
like the distributed cache you mentioned, at least not until we heard
back from the community - so, that's unplanned work at this point. It's
also hard to imagine how we could
extend that to go beyond just K8s tbh. We should definitely have a JIRA
tracking this, if that's a
direction we want to explore in the future.

I understand the change you're proposing would simplify the code but a
decision here seems hard to make
until we get some real benchmarks/measurements, or user feedback.

On Tue, Jan 9, 2018 at 7:24 PM, Matt Cheah  wrote:

> A few reasons to prefer init-containers come to mind:
>
>
>
> Firstly, if we used spark-submit from within the driver container, the
> executors wouldn’t receive the jars on their class loader until after the
> executor starts because the executor has to launch first before localizing
> resources. It is certainly possible to make the class loader work with the
> user’s jars here, as is the case with all the client mode implementations,
> but, it seems cleaner to have the classpath include the user’s jars at
> executor launch time instead of needing to reason about the classloading
> order.
>
>
>
> We can also consider the idiomatic approach from the perspective of
> Kubernetes. Yinan touched on this already, but init-containers are
> traditionally meant to prepare the environment for the application that is
> to be run, which is exactly what we do here. This also makes it such that
> the localization process can be completely decoupled from the execution of
> the application itself. We can then for example detect the errors that
> happen on the resource localization layer, say when an HDFS cluster is
> down, before the application itself launches. The failure at the
> init-container stage is explicitly noted via the Kubernetes pod status API.
>
>
>
> Finally, running spark-submit from the container would make the
> SparkSubmit code inadvertently allow running client mode Kubernetes
> applications as well. We’re not quite ready to support that. Even if we
> were, it’s not entirely intuitive for the cluster mode code path to depend
> on the client mode code path. This isn’t entirely without precedent though,
> as Mesos has a similar dependency.
>
>
>
> Essentially the semantics seem neater and the contract is very explicit
> when using an init-container, even though the code does end up being more
> complex.
>
>
>
> *From: *Yinan Li 
> *Date: *Tuesday, January 9, 2018 at 7:16 PM
> *To: *Nicholas Chammas 
> *Cc: *Anirudh Ramanathan , Marcelo Vanzin
> , Matt Cheah , Kimoon Kim <
> kim...@pepperdata.com>, dev 
> *Subject: *Re: Kubernetes: why use init containers?
>
>
>
> The init-container is required for use with the resource staging server (
> https://github.com/apache-spark-on-k8s/userdocs/blob/
> master/src/jekyll/running-on-kubernetes.md#resource-
> staging-server[github.com]
> ).
> The resource staging server (RSS) is a spark-on-k8s component running in a
> Kubernetes cluster for staging submission client local dependencies to
> 

Re: Kubernetes: why use init containers?

2018-01-09 Thread Matt Cheah
A few reasons to prefer init-containers come to mind:

 

Firstly, if we used spark-submit from within the driver container, the 
executors wouldn’t receive the jars on their class loader until after the 
executor starts because the executor has to launch first before localizing 
resources. It is certainly possible to make the class loader work with the 
user’s jars here, as is the case with all the client mode implementations, but, 
it seems cleaner to have the classpath include the user’s jars at executor 
launch time instead of needing to reason about the classloading order.

 

We can also consider the idiomatic approach from the perspective of Kubernetes. 
Yinan touched on this already, but init-containers are traditionally meant to 
prepare the environment for the application that is to be run, which is exactly 
what we do here. This also makes it such that the localization process can be 
completely decoupled from the execution of the application itself. We can then 
for example detect the errors that happen on the resource localization layer, 
say when an HDFS cluster is down, before the application itself launches. The 
failure at the init-container stage is explicitly noted via the Kubernetes pod 
status API.

 

Finally, running spark-submit from the container would make the SparkSubmit 
code inadvertently allow running client mode Kubernetes applications as well. 
We’re not quite ready to support that. Even if we were, it’s not entirely 
intuitive for the cluster mode code path to depend on the client mode code 
path. This isn’t entirely without precedent though, as Mesos has a similar 
dependency.

 

Essentially the semantics seem neater and the contract is very explicit when 
using an init-container, even though the code does end up being more complex.

 

From: Yinan Li 
Date: Tuesday, January 9, 2018 at 7:16 PM
To: Nicholas Chammas 
Cc: Anirudh Ramanathan , Marcelo Vanzin 
, Matt Cheah , Kimoon Kim 
, dev 
Subject: Re: Kubernetes: why use init containers?

 

The init-container is required for use with the resource staging server 
(https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server[github.com]).
 The resource staging server (RSS) is a spark-on-k8s component running in a 
Kubernetes cluster for staging submission client local dependencies to Spark 
pods. The init-container is responsible for downloading the dependencies from 
the RSS. We haven't upstream the RSS code yet, but this is a value add 
component for Spark on K8s as a way for users to use submission local 
dependencies without resorting to other mechanisms that are not immediately 
available on most Kubernetes clusters, e.g., HDFS. We do plan to upstream it in 
the 2.4 timeframe. Additionally, the init-container is a Kubernetes native way 
of making sure that the dependencies are localized before the main 
driver/executor containers are started. IMO, this guarantee is positive to have 
and it helps achieve separation of concerns. So IMO, I think the init-container 
is a valuable component and should be kept.

 

On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas  
wrote:

I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2[github.com] 
is so slow to launch clusters is that it distributes files like the Spark 
binaries to all the workers via the master. Because of that, the launch time 
scaled with the number of workers requested[issues.apache.org].

When I wrote Flintrock[github.com], I got a large improvement in launch time 
over spark-ec2 simply by having all the workers download the installation files 
in parallel from an external host (typically S3 or an Apache mirror). And 
launch time became largely independent of the cluster size.

That may or may not say anything about the driver distributing application 
files vs. having init containers do it in parallel, but I’d be curious to hear 
more.

Nick

​

 

On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan 
 wrote:

We were running a change in our fork which was similar to this at one point 
early on. My biggest concerns off the top of my head with this change would be 
localization performance with large numbers of executors, and what we lose in 
terms of separation of concerns. Init containers are a standard construct in 
k8s for resource localization. Also how this approach affects the HDFS work 
would be interesting.  


Re: Kubernetes: why use init containers?

2018-01-09 Thread Yinan Li
The init-container is required for use with the resource staging server (
https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server).
The resource staging server (RSS) is a spark-on-k8s component running in a
Kubernetes cluster for staging submission client local dependencies to
Spark pods. The init-container is responsible for downloading the
dependencies from the RSS. We haven't upstream the RSS code yet, but this
is a value add component for Spark on K8s as a way for users to use
submission local dependencies without resorting to other mechanisms that
are not immediately available on most Kubernetes clusters, e.g., HDFS. We
do plan to upstream it in the 2.4 timeframe. Additionally, the
init-container is a Kubernetes native way of making sure that the
dependencies are localized before the main driver/executor containers are
started. IMO, this guarantee is positive to have and it helps achieve
separation of concerns. So IMO, I think the init-container is a valuable
component and should be kept.

On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas  wrote:

> I’d like to point out the output of “git show —stat” for that diff:
> 29 files changed, 130 insertions(+), 1560 deletions(-)
>
> +1 for that and generally for the idea of leveraging spark-submit.
>
> You can argue that executors downloading from
> external servers would be faster than downloading from the driver, but
> I’m not sure I’d agree - it can go both ways.
>
> On a tangentially related note, one of the main reasons spark-ec2
>  is so slow to launch clusters is
> that it distributes files like the Spark binaries to all the workers via
> the master. Because of that, the launch time scaled with the number of
> workers requested .
>
> When I wrote Flintrock , I got a
> large improvement in launch time over spark-ec2 simply by having all the
> workers download the installation files in parallel from an external host
> (typically S3 or an Apache mirror). And launch time became largely
> independent of the cluster size.
>
> That may or may not say anything about the driver distributing application
> files vs. having init containers do it in parallel, but I’d be curious to
> hear more.
>
> Nick
> ​
>
> On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan 
> 
> wrote:
>
>> We were running a change in our fork which was similar to this at one
>> point early on. My biggest concerns off the top of my head with this change
>> would be localization performance with large numbers of executors, and what
>> we lose in terms of separation of concerns. Init containers are a standard
>> construct in k8s for resource localization. Also how this approach affects
>> the HDFS work would be interesting.
>>
>> +matt +kimoon
>> Still thinking about the potential trade offs here. Adding Matt and
>> Kimoon who would remember more about our reasoning at the time.
>>
>>
>> On Jan 9, 2018 5:22 PM, "Marcelo Vanzin"  wrote:
>>
>>> Hello,
>>>
>>> Me again. I was playing some more with the kubernetes backend and the
>>> whole init container thing seemed unnecessary to me.
>>>
>>> Currently it's used to download remote jars and files, mount the
>>> volume into the driver / executor, and place those jars in the
>>> classpath / move the files to the working directory. This is all stuff
>>> that spark-submit already does without needing extra help.
>>>
>>> So I spent some time hacking stuff and removing the init container
>>> code, and launching the driver inside kubernetes using spark-submit
>>> (similar to how standalone and mesos cluster mode works):
>>>
>>> https://github.com/vanzin/spark/commit/k8s-no-init
>>>
>>> I'd like to point out the output of "git show --stat" for that diff:
>>>  29 files changed, 130 insertions(+), 1560 deletions(-)
>>>
>>> You get massive code reuse by simply using spark-submit. The remote
>>> dependencies are downloaded in the driver, and the driver does the job
>>> of service them to executors.
>>>
>>> So I guess my question is: is there any advantage in using an init
>>> container?
>>>
>>> The current init container code can download stuff in parallel, but
>>> that's an easy improvement to make in spark-submit and that would
>>> benefit everybody. You can argue that executors downloading from
>>> external servers would be faster than downloading from the driver, but
>>> I'm not sure I'd agree - it can go both ways.
>>>
>>> Also the same idea could probably be applied to starting executors;
>>> Mesos starts executors using "spark-class" already, so doing that
>>> would both improve code sharing and potentially simplify some code in
>>> the k8s backend.
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: 

Re: Kubernetes: why use init containers?

2018-01-09 Thread Marcelo Vanzin
On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas
 wrote:
> You can argue that executors downloading from
> external servers would be faster than downloading from the driver, but
> I’m not sure I’d agree - it can go both ways.
>
> On a tangentially related note, one of the main reasons spark-ec2 is so slow
> to launch clusters is that it distributes files like the Spark binaries to
> all the workers via the master. Because of that, the launch time scaled with
> the number of workers requested.

It's true that there are side effects. But there are two things that
can be used to mitigate this:

- k8s uses docker images. Users can create docker images with all the
dependencies their app needs, and submit the app using that image.
Spark doesn't have yet documentation on how to create these customized
images, but I'd rather invest time on that instead of supporting this
init container approach.

- The original spark-on-k8s spec mentioned a "dependency server"
approach which sounded like a more generic version of the YARN
distributed cache, which I hope can be a different way of mitigating
that issue. With that work, we could build this functionality into
spark-submit itself and have other backends also benefit.

In general, forcing the download of dependencies on every invocation
of an app should be avoided.


Anirudh:
> what we lose in terms of separation of concerns

1500 less lines of code lower my level of concern a lot more.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Kubernetes: why use init containers?

2018-01-09 Thread Nicholas Chammas
I’d like to point out the output of “git show —stat” for that diff:
29 files changed, 130 insertions(+), 1560 deletions(-)

+1 for that and generally for the idea of leveraging spark-submit.

You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I’m not sure I’d agree - it can go both ways.

On a tangentially related note, one of the main reasons spark-ec2
 is so slow to launch clusters is that
it distributes files like the Spark binaries to all the workers via the
master. Because of that, the launch time scaled with the number of workers
requested .

When I wrote Flintrock , I got a
large improvement in launch time over spark-ec2 simply by having all the
workers download the installation files in parallel from an external host
(typically S3 or an Apache mirror). And launch time became largely
independent of the cluster size.

That may or may not say anything about the driver distributing application
files vs. having init containers do it in parallel, but I’d be curious to
hear more.

Nick
​

On Tue, Jan 9, 2018 at 9:08 PM Anirudh Ramanathan
 wrote:

> We were running a change in our fork which was similar to this at one
> point early on. My biggest concerns off the top of my head with this change
> would be localization performance with large numbers of executors, and what
> we lose in terms of separation of concerns. Init containers are a standard
> construct in k8s for resource localization. Also how this approach affects
> the HDFS work would be interesting.
>
> +matt +kimoon
> Still thinking about the potential trade offs here. Adding Matt and Kimoon
> who would remember more about our reasoning at the time.
>
>
> On Jan 9, 2018 5:22 PM, "Marcelo Vanzin"  wrote:
>
>> Hello,
>>
>> Me again. I was playing some more with the kubernetes backend and the
>> whole init container thing seemed unnecessary to me.
>>
>> Currently it's used to download remote jars and files, mount the
>> volume into the driver / executor, and place those jars in the
>> classpath / move the files to the working directory. This is all stuff
>> that spark-submit already does without needing extra help.
>>
>> So I spent some time hacking stuff and removing the init container
>> code, and launching the driver inside kubernetes using spark-submit
>> (similar to how standalone and mesos cluster mode works):
>>
>> https://github.com/vanzin/spark/commit/k8s-no-init
>>
>> I'd like to point out the output of "git show --stat" for that diff:
>>  29 files changed, 130 insertions(+), 1560 deletions(-)
>>
>> You get massive code reuse by simply using spark-submit. The remote
>> dependencies are downloaded in the driver, and the driver does the job
>> of service them to executors.
>>
>> So I guess my question is: is there any advantage in using an init
>> container?
>>
>> The current init container code can download stuff in parallel, but
>> that's an easy improvement to make in spark-submit and that would
>> benefit everybody. You can argue that executors downloading from
>> external servers would be faster than downloading from the driver, but
>> I'm not sure I'd agree - it can go both ways.
>>
>> Also the same idea could probably be applied to starting executors;
>> Mesos starts executors using "spark-class" already, so doing that
>> would both improve code sharing and potentially simplify some code in
>> the k8s backend.
>>
>> --
>> Marcelo
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Kubernetes: why use init containers?

2018-01-09 Thread Anirudh Ramanathan
We were running a change in our fork which was similar to this at one point
early on. My biggest concerns off the top of my head with this change would
be localization performance with large numbers of executors, and what we
lose in terms of separation of concerns. Init containers are a standard
construct in k8s for resource localization. Also how this approach affects
the HDFS work would be interesting.

+matt +kimoon
Still thinking about the potential trade offs here. Adding Matt and Kimoon
who would remember more about our reasoning at the time.


On Jan 9, 2018 5:22 PM, "Marcelo Vanzin"  wrote:

> Hello,
>
> Me again. I was playing some more with the kubernetes backend and the
> whole init container thing seemed unnecessary to me.
>
> Currently it's used to download remote jars and files, mount the
> volume into the driver / executor, and place those jars in the
> classpath / move the files to the working directory. This is all stuff
> that spark-submit already does without needing extra help.
>
> So I spent some time hacking stuff and removing the init container
> code, and launching the driver inside kubernetes using spark-submit
> (similar to how standalone and mesos cluster mode works):
>
> https://github.com/vanzin/spark/commit/k8s-no-init
>
> I'd like to point out the output of "git show --stat" for that diff:
>  29 files changed, 130 insertions(+), 1560 deletions(-)
>
> You get massive code reuse by simply using spark-submit. The remote
> dependencies are downloaded in the driver, and the driver does the job
> of service them to executors.
>
> So I guess my question is: is there any advantage in using an init
> container?
>
> The current init container code can download stuff in parallel, but
> that's an easy improvement to make in spark-submit and that would
> benefit everybody. You can argue that executors downloading from
> external servers would be faster than downloading from the driver, but
> I'm not sure I'd agree - it can go both ways.
>
> Also the same idea could probably be applied to starting executors;
> Mesos starts executors using "spark-class" already, so doing that
> would both improve code sharing and potentially simplify some code in
> the k8s backend.
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Kubernetes: why use init containers?

2018-01-09 Thread Marcelo Vanzin
Hello,

Me again. I was playing some more with the kubernetes backend and the
whole init container thing seemed unnecessary to me.

Currently it's used to download remote jars and files, mount the
volume into the driver / executor, and place those jars in the
classpath / move the files to the working directory. This is all stuff
that spark-submit already does without needing extra help.

So I spent some time hacking stuff and removing the init container
code, and launching the driver inside kubernetes using spark-submit
(similar to how standalone and mesos cluster mode works):

https://github.com/vanzin/spark/commit/k8s-no-init

I'd like to point out the output of "git show --stat" for that diff:
 29 files changed, 130 insertions(+), 1560 deletions(-)

You get massive code reuse by simply using spark-submit. The remote
dependencies are downloaded in the driver, and the driver does the job
of service them to executors.

So I guess my question is: is there any advantage in using an init container?

The current init container code can download stuff in parallel, but
that's an easy improvement to make in spark-submit and that would
benefit everybody. You can argue that executors downloading from
external servers would be faster than downloading from the driver, but
I'm not sure I'd agree - it can go both ways.

Also the same idea could probably be applied to starting executors;
Mesos starts executors using "spark-class" already, so doing that
would both improve code sharing and potentially simplify some code in
the k8s backend.

-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts
published to Palantir's bintray at
https://palantir.bintray.com/releases/org/apache/spark/  If you're seeing
any of them in Maven Central please flag, as that's a mistake!

Andrew

On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen  wrote:

> Just to follow up -- those are actually in a Palantir repo, not Central.
> Deploying to Central would be uncourteous, but this approach is legitimate
> and how it has to work for vendors to release distros of Spark etc.
>
>
> On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu  wrote:
>
>> Hi, all
>>
>> Out of curious, I just found a bunch of Palantir release under
>> org.apache.spark in maven central (https://mvnrepository.com/
>> artifact/org.apache.spark/spark-core_2.11)?
>>
>> Is it on purpose?
>>
>> Best,
>>
>> Nan
>>
>>
>>


Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Sean Owen
Just to follow up -- those are actually in a Palantir repo, not Central.
Deploying to Central would be uncourteous, but this approach is legitimate
and how it has to work for vendors to release distros of Spark etc.

On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu  wrote:

> Hi, all
>
> Out of curious, I just found a bunch of Palantir release under
> org.apache.spark in maven central (
> https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)?
>
> Is it on purpose?
>
> Best,
>
> Nan
>
>
>


Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu
nvm

On Tue, Jan 9, 2018 at 9:42 AM, Nan Zhu  wrote:

> Hi, all
>
> Out of curious, I just found a bunch of Palantir release under
> org.apache.spark in maven central (https://mvnrepository.com/
> artifact/org.apache.spark/spark-core_2.11)?
>
> Is it on purpose?
>
> Best,
>
> Nan
>
>
>


Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu
Hi, all

Out of curious, I just found a bunch of Palantir release under
org.apache.spark in maven central (
https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)?

Is it on purpose?

Best,

Nan


Re: Integration testing and Scheduler Backends

2018-01-09 Thread Reynold Xin
If we can actually get our acts together and have integration tests in
Jenkins (perhaps not run on every commit but can be run weekly or
pre-release smoke tests), that'd be great. Then it relies less on
contributors manually testing.


On Tue, Jan 9, 2018 at 8:09 AM, Timothy Chen  wrote:

> 2) will be ideal but given the velocity of main branch, what Mesos
> ended up doing was simply having a separate repo since it will take
> too long to merge back to main.
>
> We ended up running it pre-release (or major PR merged) and not on
> every PR, I will also comment on asking users to run it.
>
> We did have conversations with Reynold about potentially have the
> ability to run the CI on every [Mesos] tagged PR but we never got
> there.
>
> Tim
>
> On Mon, Jan 8, 2018 at 10:16 PM, Anirudh Ramanathan
>  wrote:
> > This is with regard to the Kubernetes Scheduler Backend and scaling the
> > process to accept contributions. Given we're moving past upstreaming
> changes
> > from our fork, and into getting new patches, I wanted to start this
> > discussion sooner than later. This is more of a post-2.3 question - not
> > something we're looking to solve right away.
> >
> > While unit tests are handy, they're not nearly as good at giving us
> > confidence as a successful run of our integration tests against
> > single/multi-node k8s clusters. Currently, we have integration testing
> setup
> > at https://github.com/apache-spark-on-k8s/spark-integration and it's
> running
> > continuously against apache/spark:master in pepperdata-jenkins (on
> minikube)
> > & k8s-testgrid (in GKE clusters). Now, the question is - how do we make
> > integration-tests part of the PR author's workflow?
> >
> > 1. Keep the integration tests in the separate repo and require that
> > contributors run them, add new tests prior to accepting their PRs as a
> > policy. Given minikube is easy to setup and can run on a single-node, it
> > would certainly be possible. Friction however, stems from contributors
> > potentially having to modify the integration test code hosted in that
> > separate repository when adding/changing functionality in the scheduler
> > backend. Also, it's certainly going to lead to at least brief
> > inconsistencies between the two repositories.
> >
> > 2. Alternatively, we check in the integration tests alongside the actual
> > scheduler backend code. This would work really well and is what we did in
> > our fork. It would have to be a separate package which would take certain
> > parameters (like cluster endpoint) and run integration test code against
> a
> > local or remote cluster. It would include least some code dealing with
> > accessing the cluster, reading results from K8s containers, test
> fixtures,
> > etc.
> >
> > I see value in adopting (2), given it's a clearer path for contributors
> and
> > lets us keep the two pieces consistent, but it seems uncommon elsewhere.
> How
> > do the other backends, i.e. YARN, Mesos and Standalone deal with
> accepting
> > patches and ensuring that they do not break existing clusters? Is there
> > automation employed for this thus far? Would love to get opinions on (1)
> v/s
> > (2).
> >
> > Thanks,
> > Anirudh
> >
> >
>


DataFrame to DataSet[String]

2018-01-09 Thread Lalwani, Jayesh
SPARK-15463 (https://issues.apache.org/jira/browse/SPARK-15463) was implemented 
in 2.2.0 and it allows you to take a Dataset[String] with raw CSV/JSON and 
convert it into a Dataframe. Should we have a way to go the other way too? 
Provide a way to convert Dataframe to DataSet[String]



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.