Is this something that Apache infra could help us with?

On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <[email protected]>
wrote:

> Summary:
>
> For IO ITs that use data stores that need custom docker images in order to
> run, we can't currently use them in a kubernetes cluster (which is where we
> host our data stores.) I have a couple options for how to solve this and am
> looking for feedback from folks involved in creating IO ITs/opinions on
> kubernetes.
>
>
> Details:
>
> We've discussed in the past that we'll want to allow developers to submit
> just a dockerfile, and then we'll use that when creating the data store on
> kubernetes. This is the case for ElasticsearchIO and I assume more data
> stores in the future will want to do this. It's also looking like it'll be
> necessary to use custom docker images for the HadoopInputFormatIO's
> cassandra ITs - to run a cassandra cluster, there doesn't seem to be a good
> image you can use out of the box.
>
> In either case, in order to retrieve a docker image, kubernetes needs a
> container registry - it will read the docker images from there. A simple
> private container registry doesn't work because kubernetes config files are
> static - this means that if local devs try to use the kubernetes files,
> they point at the private container registry and they wouldn't be able to
> retrieve the images since they don't have access. They'd have to manually
> edit the files, which in theory is an option, but I don't consider that to
> be acceptable since it feels pretty unfriendly (it is simple, so if we
> really don't like the below options we can revisit it.)
>
> Quick summary of the options
>
> =======================
>
> We can:
>
> * Start using something like k8 helm - this adds more dependencies, adds a
> small amount of complexity (this is my recommendation, but only by a
> little)
>
> * Start pushing images to docker hub - this means they'll be publicly
> visible and raises the bar for maintenance of those images
>
> * Host our own public container registry - this means running our own
> public service with costs, etc..
>
> Below are detailed discussions of these options. You can skip to the "My
> thoughts on this" section if you're not interested in the details.
>
>
> 1. Templated kubernetes images
>
> =========================
>
> Kubernetes (k8) does not currently have built in support for parameterizing
> scripts - there's an issues open for this[1], but it doesn't seem to be
> very active.
>
> There are tools like Kubernetes helm that allow users to specify parameters
> when running their kubernetes scripts. They also enable a lot more (they're
> probably closer to a package manager like apt-get) - see this
> description[3] for an overview.
>
> I'm open to other options besides helm, but it seems to be the officially
> supported one.
>
> How the world would look using helm:
>
> * When developing an IO IT, someone (either the developer or one of us),
> would need to create a chart (the name for the helm script) - it's
> basically another set of config files but in theory is as simple as a
> couple metadata files plus a templatized version of a regular k8 script.
> This should be trivial compared to the task of creating a k8 script.
>
> *  When creating an instance of a data store, the developer (or the beam CI
> server) would first build the docker image for the data store and push to
> their container registry, then run a command like `helm install -f
> mydb.yaml --set imageRepo=1.2.3.4`
>
> * when done running tests/developing/etc…  the developer/beam CI server
> would run `helm delete -f mydb.yaml`
>
> Upsides:
>
> * Something like helm is pretty interesting - we talked about it as an
> upside and something we wanted to do when we talked about using kubernetes
>
> * We pick up a set of working kubernetes scripts this way. The full list is
> at [2], but some ones that stood out: mongodb, memcached, mysql, postgres,
> redis, elasticsearch (incubating), kafka (incubating), zookeeper
> (incubating) - this could speed development
>
> Downsides:
>
> * Adds an additional dependency to run our ITs (helm or another k8
> templating tool)
>
> * Requires people to build their own images run a container registry if
> they don't already have one (it will not surprise you that there's a docker
> image for running the registry [0] - so it's not crazy. :) I *think* this
> will probably just be a simple one/two line command once we have it
> scripted.
>
> * Helm in particular is kind of heavyweight for what we really need - it
> requires running a service in the k8 cluster and adds additional
> complexity.
>
> * Adds to the complexity of creating a new kubernetes script. Until I've
> tried it, I can't really speak to the complexity, but taking a look at the
> instructions [4], it doesn't seem too bad.
>
>
>
>
> 2. Push images to docker hub
>
> =======================
>
> This requires that users push images that we want to use to docker hub, and
> then our IO ITs will rely on that. I  think the developer of the dockerfile
> should be responsible for the image - having the beam project responsible
> for a publicly available artifact (like the docker images) outside of our
> core deliverables doesn't seem like the right move.
>
> We would still retain a copy of the source dockerfiles and could regenerate
> the images at any time, so I'm not concerned about a scenario where docker
> hub went away (it would be pretty simple to switch to another repo - just
> change some config files.)
>
> For someone running the k8 scripts (ie, running the IO ITs), this is pretty
> easy - they just run the k8 script like they do today.
>
> For someone creating the k8 scripts (ie, creating the IO ITs), this is more
> complex - either they or we have to push this to docker hub and make sure
> it's up to date, etc..
>
>
> Upsides:
>
> * No additional complexity for IO IT runners.
>
> Downsides:
>
> * Higher bar for creating the image in the first place - someone has to
> maintain the publicly available docker hub image.
>
> * It seems weird to have a custom docker image up on docker hub - maybe
> that's common, but if we need specific changes to images for our needs, I'd
> prefer it be private.
>
>
> 3. Run our own *public* container registry
>
> ==============================================
>
> We would run a beam-specific container registry service - it would be used
> by the apache beam CI servers, but it would also be available for use by
> anyone running beam IO ITs on their local dev setup.
>
> From a IO IT creator's perspective, this would look pretty similar to how
> things are now - they just check in a dockerfile. For someone running the
> k8 scripts, they similarly don't need to think about it.
>
> Upsides:
>
> * we're not adding any additional complexity for end developer
>
> Downsides:
>
> * Have to keep docker registry software up to date
>
> * The service is a single of failure for any beam devs running IO ITs
>
> * It can incur costs, etc… As an open source project, it doesn't seem great
> for us to be running a public service.
>
>
>
> My thoughts on this
>
> ===============
>
> In spite of the additional complexity, I think using k8 helm is probably
> the best option. The general goal behind the IO ITs has been to keep
> ourselves self-contained: avoid having centralized infrastructure for those
> running the ITs. Helm is a good match for those criteria. I will admit that
> I find the additional dependencies/complexity to be worrisome. However, I
> really like the idea of picking up additional data store configs for free -
> if we were doing this in 5 years, we'd say "we should just use the
> ecosystem of helm charts" and go from there.
>
> I do think that pushing images to docker hub is a viable option, and if the
> community is more excited to do that/wants to push the images there, I'd
> support it. I can see how folks would be hesitant. I would like for the
> developer of the docker file to do
>
> Of the 3 options, I would strongly push back against running a public
> container registry - I would not want to administer it, and I don't think
> we as a project want to be paying for the costs associated with it.
>
> Next steps
>
> =========
>
> Let me know what you think! This is definitely a topic where understanding
> what the community of IO devs wants is helpful. As we discuss, I'll
> probably spend a little time exploring helm since I want to play around
> with it and understand if there are other drawbacks. I ran into this
> question while working on getting the HIFIO cassandra cluster running, so I
> might prototype with that.
>
> I'll create JIRA for this in the next day or so.
>
> Stephen
>
>
>
> [0] docker registry container - https://hub.docker.com/_/registry/
>
> [1] kubernetes issue open for supporting templates -
> https://github.com/kubernetes/kubernetes/issues/23896
>
> [2] set of available charts - https://github.com/kubernetes/charts
>
> [3] kubernetes helm introduction -
> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
> [4] kubernetes charts instructions -
> https://github.com/kubernetes/helm/blob/master/docs/charts.md
>

Reply via email to