Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-28 Thread Kimoon Kim
Thanks for starting this discussion.

When I was troubleshooting Spark on K8s, I often faced a need to turn on
debug messages on the driver and executor pods of my jobs, which would be
possible if I somehow put the right log4j.properties file inside the pods.
I know I can build custom Docker images, but that seems like too much. (So
being lazy, I usually just gave up)

If there is an alternative mechanism, like using a ConfigMap, I would
prefer that for this log4j need. Maybe we should document what are possible
alternatives to building Docker images for certain use cases and guide
people toward the right mechanisms?

Thanks,
Kimoon


Thanks,
Kimoon

On Wed, Mar 21, 2018 at 10:54 PM, Felix Cheung 
wrote:

> I like being able to customize the docker image itself - but I realize
> this thread is more about “API” for the stock image.
>
> Environment is nice. Probably we need a way to set custom spark config (as
> a file??)
>
>
> --
> *From:* Holden Karau 
> *Sent:* Wednesday, March 21, 2018 10:44:20 PM
> *To:* Erik Erlandson
> *Cc:* dev
> *Subject:* Re: Toward an "API" for spark images used by the Kubernetes
> back-end
>
> I’m glad this discussion is happening on dev@ :)
>
> Personally I like customizing with shell env variables during rolling my
> own image, but definitely documentation the expectations/usage of the
> variables is needed before we can really call it an API.
>
> On the related question I suspect two of the more “common” likely
> customizations is adding additional jars for bootstrapping fetching from a
> DFS & also similarity complicated Python dependencies (although given the
> Pythons support isn’t merged yet it’s hard to say what exactly this would
> look like).
>
> I could also see some vendors wanting to add some bootstrap/setup scripts
> to fetch keys or other things.
>
> What other ways do folks foresee customizing their Spark docker
> containers?
>
> On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson 
> wrote:
>
>> During the review of the recent PR to remove use of the init_container
>> from kube pods as created by the Kubernetes back-end, the topic of
>> documenting the "API" for these container images also came up. What
>> information does the back-end provide to these containers? In what form?
>> What assumptions does the back-end make about the structure of these
>> containers?  This information is important in a scenario where a user wants
>> to create custom images, particularly if these are not based on the
>> reference dockerfiles.
>>
>> A related topic is deciding what such an API should look like.  For
>> example, early incarnations were based more purely on environment
>> variables, which could have advantages in terms of an API that is easy to
>> describe in a document.  If we document the current API, should we annotate
>> it as Experimental?  If not, does that effectively freeze the API?
>>
>> We are interested in community input about possible customization use
>> cases and opinions on possible API designs!
>> Cheers,
>> Erik
>>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Matt Cheah
Re: Hadoop versioning – it seems reasonable enough for us to be publishing an 
image per Hadoop version. We should essentially have image configuration parity 
with what we publish as distributions on the Spark website.

 

Sometimes jars need to be swapped out entirely instead of being strictly 
additive. An example is a user wanting to build an application that depends on 
a different version of an existing dependency. Instead of adding multiple jars 
with different versions to the classpath, they would like to put their own jars 
that their application has perhaps resolved via Maven. (They could use the 
userClassPathFirst constructs, but in practice that doesn’t always work 
particularly for jars that have to be present at JVM boot-time.) So having an 
extra image version that is “empty” without any jars is reasonable. In this 
case, we’d want to define the API for where the image’s jars have to live – 
perhaps in a fixed directory like /opt/spark/jars, or specified by some 
environment variable that the entrypoint knows to look up. I like the idea of 
having that location defined by an environment variable, since it allows for 
more flexibility – but the tradeoff seems negligible between those two options.

 

From: "Lalwani, Jayesh" 
Date: Thursday, March 22, 2018 at 10:19 AM
To: Rob Vesse , "dev@spark.apache.org" 

Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

 

I would like to add that many people run Spark behind corporate proxies. It’s 
very common to add http proxy to extraJavaOptions.  Being able to provide 
custom extraJavaOption should be supported.

Also, Hadoop FS 2.7.3 is pretty limited wrt S3 buckets. You cannot use 
temporary AWS tokens. You cannot assume roles. You cannot use KMS buckets. All 
of this comes out of the box on EMR because EMR is build with it’s own 
customized Hadoop FS. For standalone installations, It’s pretty common to 
“customize” your Spark installation using Hadoop 2.8.3 or higher. I don’t know 
if a Spark container with Hadoop 2.8.3 will be a standard container. If it 
isn’t, I see a lot of people creating a customized container with Hadoop FS 
2.8.3


From: Rob Vesse 
Date: Thursday, March 22, 2018 at 6:11 AM
To: "dev@spark.apache.org" 
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

 

The difficulty with a custom Spark config is that you need to be careful that 
the Spark config the user provides does not conflict with the auto-generated 
portions of the Spark config necessary to make Spark on K8S work.  So part of 
any “API” definition might need to be what Spark config is considered “managed” 
by the Kubernetes scheduler backend.

 

For more controlled environments - i.e. security conscious - allowing end users 
to provide custom images may be a non-starter so the more we can do at the 
“API” level without customising the containers the better.  A practical example 
of this is managing Python dependencies, one option we’re considering is having 
a base image with Anaconda included and then simply projecting a Conda 
environment spec into the containers (via volume mounts) and then having the 
container recreate that Conda environment on startup.  That won’t work for all 
possible environments e.g. those that use non-standard Conda channels but it 
would provide a lot of capability without customising the images.

 

Rob

 

From: Felix Cheung 
Date: Thursday, 22 March 2018 at 06:21
To: Holden Karau , Erik Erlandson 
Cc: dev 
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

 

I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

 

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)

 

 

From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end 

 

I’m glad this discussion is happening on dev@ :)

 

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

 

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

 

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

 

What other ways do folks foresee customizing their Spark docker containers? 

 

On Wed, Mar 21, 2018 at 5:04 

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Lalwani, Jayesh
I would like to add that many people run Spark behind corporate proxies. It’s 
very common to add http proxy to extraJavaOptions.  Being able to provide 
custom extraJavaOption should be supported.

Also, Hadoop FS 2.7.3 is pretty limited wrt S3 buckets. You cannot use 
temporary AWS tokens. You cannot assume roles. You cannot use KMS buckets. All 
of this comes out of the box on EMR because EMR is build with it’s own 
customized Hadoop FS. For standalone installations, It’s pretty common to 
“customize” your Spark installation using Hadoop 2.8.3 or higher. I don’t know 
if a Spark container with Hadoop 2.8.3 will be a standard container. If it 
isn’t, I see a lot of people creating a customized container with Hadoop FS 
2.8.3

From: Rob Vesse 
Date: Thursday, March 22, 2018 at 6:11 AM
To: "dev@spark.apache.org" 
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

The difficulty with a custom Spark config is that you need to be careful that 
the Spark config the user provides does not conflict with the auto-generated 
portions of the Spark config necessary to make Spark on K8S work.  So part of 
any “API” definition might need to be what Spark config is considered “managed” 
by the Kubernetes scheduler backend.

For more controlled environments - i.e. security conscious - allowing end users 
to provide custom images may be a non-starter so the more we can do at the 
“API” level without customising the containers the better.  A practical example 
of this is managing Python dependencies, one option we’re considering is having 
a base image with Anaconda included and then simply projecting a Conda 
environment spec into the containers (via volume mounts) and then having the 
container recreate that Conda environment on startup.  That won’t work for all 
possible environments e.g. those that use non-standard Conda channels but it 
would provide a lot of capability without customising the images.

Rob

From: Felix Cheung 
Date: Thursday, 22 March 2018 at 06:21
To: Holden Karau , Erik Erlandson 
Cc: dev 
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)



From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson 
> wrote:
During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

We are interested in community input about possible customization use cases and 
opinions on possible API designs!
Cheers,
Erik
--
Twitter: 
https://twitter.com/holdenkarau


The information contained in this e-mail is 

Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-22 Thread Rob Vesse
The difficulty with a custom Spark config is that you need to be careful that 
the Spark config the user provides does not conflict with the auto-generated 
portions of the Spark config necessary to make Spark on K8S work.  So part of 
any “API” definition might need to be what Spark config is considered “managed” 
by the Kubernetes scheduler backend.

 

For more controlled environments - i.e. security conscious - allowing end users 
to provide custom images may be a non-starter so the more we can do at the 
“API” level without customising the containers the better.  A practical example 
of this is managing Python dependencies, one option we’re considering is having 
a base image with Anaconda included and then simply projecting a Conda 
environment spec into the containers (via volume mounts) and then having the 
container recreate that Conda environment on startup.  That won’t work for all 
possible environments e.g. those that use non-standard Conda channels but it 
would provide a lot of capability without customising the images.

 

Rob

 

From: Felix Cheung 
Date: Thursday, 22 March 2018 at 06:21
To: Holden Karau , Erik Erlandson 
Cc: dev 
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

 

I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

 

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)

 

 

From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end 

 

I’m glad this discussion is happening on dev@ :)

 

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

 

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

 

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

 

What other ways do folks foresee customizing their Spark docker containers? 

 

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson  wrote:

During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

 

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

 

We are interested in community input about possible customization use cases and 
opinions on possible API designs!

Cheers,

Erik

-- 

Twitter: https://twitter.com/holdenkarau



Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Felix Cheung
I like being able to customize the docker image itself - but I realize this 
thread is more about “API” for the stock image.

Environment is nice. Probably we need a way to set custom spark config (as a 
file??)



From: Holden Karau 
Sent: Wednesday, March 21, 2018 10:44:20 PM
To: Erik Erlandson
Cc: dev
Subject: Re: Toward an "API" for spark images used by the Kubernetes back-end

I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my own 
image, but definitely documentation the expectations/usage of the variables is 
needed before we can really call it an API.

On the related question I suspect two of the more “common” likely 
customizations is adding additional jars for bootstrapping fetching from a DFS 
& also similarity complicated Python dependencies (although given the Pythons 
support isn’t merged yet it’s hard to say what exactly this would look like).

I could also see some vendors wanting to add some bootstrap/setup scripts to 
fetch keys or other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson 
> wrote:
During the review of the recent PR to remove use of the init_container from 
kube pods as created by the Kubernetes back-end, the topic of documenting the 
"API" for these container images also came up. What information does the 
back-end provide to these containers? In what form? What assumptions does the 
back-end make about the structure of these containers?  This information is 
important in a scenario where a user wants to create custom images, 
particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For example, 
early incarnations were based more purely on environment variables, which could 
have advantages in terms of an API that is easy to describe in a document.  If 
we document the current API, should we annotate it as Experimental?  If not, 
does that effectively freeze the API?

We are interested in community input about possible customization use cases and 
opinions on possible API designs!
Cheers,
Erik
--
Twitter: https://twitter.com/holdenkarau


Re: Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Holden Karau
I’m glad this discussion is happening on dev@ :)

Personally I like customizing with shell env variables during rolling my
own image, but definitely documentation the expectations/usage of the
variables is needed before we can really call it an API.

On the related question I suspect two of the more “common” likely
customizations is adding additional jars for bootstrapping fetching from a
DFS & also similarity complicated Python dependencies (although given the
Pythons support isn’t merged yet it’s hard to say what exactly this would
look like).

I could also see some vendors wanting to add some bootstrap/setup scripts
to fetch keys or other things.

What other ways do folks foresee customizing their Spark docker containers?

On Wed, Mar 21, 2018 at 5:04 PM Erik Erlandson  wrote:

> During the review of the recent PR to remove use of the init_container
> from kube pods as created by the Kubernetes back-end, the topic of
> documenting the "API" for these container images also came up. What
> information does the back-end provide to these containers? In what form?
> What assumptions does the back-end make about the structure of these
> containers?  This information is important in a scenario where a user wants
> to create custom images, particularly if these are not based on the
> reference dockerfiles.
>
> A related topic is deciding what such an API should look like.  For
> example, early incarnations were based more purely on environment
> variables, which could have advantages in terms of an API that is easy to
> describe in a document.  If we document the current API, should we annotate
> it as Experimental?  If not, does that effectively freeze the API?
>
> We are interested in community input about possible customization use
> cases and opinions on possible API designs!
> Cheers,
> Erik
>
-- 
Twitter: https://twitter.com/holdenkarau


Toward an "API" for spark images used by the Kubernetes back-end

2018-03-21 Thread Erik Erlandson
During the review of the recent PR to remove use of the init_container from
kube pods as created by the Kubernetes back-end, the topic of documenting
the "API" for these container images also came up. What information does
the back-end provide to these containers? In what form? What assumptions
does the back-end make about the structure of these containers?  This
information is important in a scenario where a user wants to create custom
images, particularly if these are not based on the reference dockerfiles.

A related topic is deciding what such an API should look like.  For
example, early incarnations were based more purely on environment
variables, which could have advantages in terms of an API that is easy to
describe in a document.  If we document the current API, should we annotate
it as Experimental?  If not, does that effectively freeze the API?

We are interested in community input about possible customization use cases
and opinions on possible API designs!
Cheers,
Erik