Thanks for starting this discussion.
When I was troubleshooting Spark on K8s, I often faced a need to turn on
debug messages on the driver and executor pods of my jobs, which would be
possible if I somehow put the right log4j.properties file inside the pods.
I know I can build custom Docker
Re: Hadoop versioning – it seems reasonable enough for us to be publishing an
image per Hadoop version. We should essentially have image configuration parity
with what we publish as distributions on the Spark website.
Sometimes jars need to be swapped out entirely instead of being strictly
I would like to add that many people run Spark behind corporate proxies. It’s
very common to add http proxy to extraJavaOptions. Being able to provide
custom extraJavaOption should be supported.
Also, Hadoop FS 2.7.3 is pretty limited wrt S3 buckets. You cannot use
temporary AWS tokens. You
The difficulty with a custom Spark config is that you need to be careful that
the Spark config the user provides does not conflict with the auto-generated
portions of the Spark config necessary to make Spark on K8S work. So part of
any “API” definition might need to be what Spark config is
I like being able to customize the docker image itself - but I realize this
thread is more about “API” for the stock image.
Environment is nice. Probably we need a way to set custom spark config (as a
file??)
From: Holden Karau
Sent:
I’m glad this discussion is happening on dev@ :)
Personally I like customizing with shell env variables during rolling my
own image, but definitely documentation the expectations/usage of the
variables is needed before we can really call it an API.
On the related question I suspect two of the
During the review of the recent PR to remove use of the init_container from
kube pods as created by the Kubernetes back-end, the topic of documenting
the "API" for these container images also came up. What information does
the back-end provide to these containers? In what form? What assumptions