Re: Kubernetes: why use init containers?

2018-01-09 Thread Anirudh Ramanathan
Marcelo, I can see that we might be misunderstanding what this change implies for performance and some of the deeper implementation details here. We have a community meeting tomorrow (at 10am PT), and we'll be sure to explore this idea in detail, and understand the implications and then get back

Re: Kubernetes: why use init containers?

2018-01-09 Thread Marcelo Vanzin
One thing I forgot in my previous e-mail is that if a resource is remote I'm pretty sure (but haven't double checked the code) that executors will download it directly from the remote server, and not from the driver. So there, distributed download without an init container. On Tue, Jan 9, 2018 at

Re: Kubernetes: why use init containers?

2018-01-09 Thread Anirudh Ramanathan
Marcelo, to address the points you raised: > k8s uses docker images. Users can create docker images with all the dependencies their app needs, and submit the app using that image. The entire reason why we support additional methods of localizing dependencies than baking everything into docker

Re: Kubernetes: why use init containers?

2018-01-09 Thread Matt Cheah
A few reasons to prefer init-containers come to mind: Firstly, if we used spark-submit from within the driver container, the executors wouldn’t receive the jars on their class loader until after the executor starts because the executor has to launch first before localizing resources. It is

Re: Kubernetes: why use init containers?

2018-01-09 Thread Yinan Li
The init-container is required for use with the resource staging server ( https://github.com/apache-spark-on-k8s/userdocs/blob/master/src/jekyll/running-on-kubernetes.md#resource-staging-server). The resource staging server (RSS) is a spark-on-k8s component running in a Kubernetes cluster for

Re: Kubernetes: why use init containers?

2018-01-09 Thread Marcelo Vanzin
On Tue, Jan 9, 2018 at 6:25 PM, Nicholas Chammas wrote: > You can argue that executors downloading from > external servers would be faster than downloading from the driver, but > I’m not sure I’d agree - it can go both ways. > > On a tangentially related note, one of

Re: Kubernetes: why use init containers?

2018-01-09 Thread Nicholas Chammas
I’d like to point out the output of “git show —stat” for that diff: 29 files changed, 130 insertions(+), 1560 deletions(-) +1 for that and generally for the idea of leveraging spark-submit. You can argue that executors downloading from external servers would be faster than downloading from the

Re: Kubernetes: why use init containers?

2018-01-09 Thread Anirudh Ramanathan
We were running a change in our fork which was similar to this at one point early on. My biggest concerns off the top of my head with this change would be localization performance with large numbers of executors, and what we lose in terms of separation of concerns. Init containers are a standard

Kubernetes: why use init containers?

2018-01-09 Thread Marcelo Vanzin
Hello, Me again. I was playing some more with the kubernetes backend and the whole init container thing seemed unnecessary to me. Currently it's used to download remote jars and files, mount the volume into the driver / executor, and place those jars in the classpath / move the files to the

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Andrew Ash
That source repo is at https://github.com/palantir/spark/ with artifacts published to Palantir's bintray at https://palantir.bintray.com/releases/org/apache/spark/ If you're seeing any of them in Maven Central please flag, as that's a mistake! Andrew On Tue, Jan 9, 2018 at 10:10 AM, Sean Owen

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Sean Owen
Just to follow up -- those are actually in a Palantir repo, not Central. Deploying to Central would be uncourteous, but this approach is legitimate and how it has to work for vendors to release distros of Spark etc. On Tue, Jan 9, 2018 at 11:43 AM Nan Zhu wrote: > Hi,

Re: Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu
nvm On Tue, Jan 9, 2018 at 9:42 AM, Nan Zhu wrote: > Hi, all > > Out of curious, I just found a bunch of Palantir release under > org.apache.spark in maven central (https://mvnrepository.com/ > artifact/org.apache.spark/spark-core_2.11)? > > Is it on purpose? > > Best, >

Palantir replease under org.apache.spark?

2018-01-09 Thread Nan Zhu
Hi, all Out of curious, I just found a bunch of Palantir release under org.apache.spark in maven central ( https://mvnrepository.com/artifact/org.apache.spark/spark-core_2.11)? Is it on purpose? Best, Nan

Re: Integration testing and Scheduler Backends

2018-01-09 Thread Reynold Xin
If we can actually get our acts together and have integration tests in Jenkins (perhaps not run on every commit but can be run weekly or pre-release smoke tests), that'd be great. Then it relies less on contributors manually testing. On Tue, Jan 9, 2018 at 8:09 AM, Timothy Chen

DataFrame to DataSet[String]

2018-01-09 Thread Lalwani, Jayesh
SPARK-15463 (https://issues.apache.org/jira/browse/SPARK-15463) was implemented in 2.2.0 and it allows you to take a Dataset[String] with raw CSV/JSON and convert it into a Dataframe. Should we have a way to go the other way too? Provide a way to convert Dataframe to DataSet[String]