Re: IO ITs: Hosting Docker images

Ted Yu Sat, 08 Apr 2017 05:31:38 -0700

+1


> On Apr 7, 2017, at 10:46 PM, Jean-Baptiste Onofré <j...@nanthrax.net> wrote:
> 
> Hi Stephen,
> 
> I think we should go to 1 and 4:
> 
> 1. Try to use existing images providing what we need. If we don't find 
> existing image, we can always ask and help other community to provide so.
> 4. If we don't find a suitable image, and waiting for this image, we can 
> store the image in our own "IT dockerhub".
> 
> Regards
> JB
> 
>> On 04/08/2017 01:03 AM, Stephen Sisk wrote:
>> Wanted to see if anyone else had opinions on this/provide a quick update.
>> 
>> I think for both elasticsearch and HIFIO that we can find existing,
>> supported images that could serve those purposes - HIFIO is looking like
>> it'll able to do so for cassandra, which was proving tricky.
>> 
>> So to summarize my current proposed solutions: (ordered by my preference)
>> 1. (new) Strongly urge people to find existing docker images that meet our
>> image criteria - regularly updated/security checked
>> 2. Start using helm
>> 3. Push our docker images to docker hub
>> 4. Host our own public container registry
>> 
>> S
>> 
>>> On Tue, Apr 4, 2017 at 10:16 AM Stephen Sisk <s...@google.com> wrote:
>>> 
>>> I'd like to hear what direction folks want to go in, and from there look
>>> at the options. I think for some of these options (like running our own
>>> public registry), they may be able to and it's something we should look at,
>>> but I don't assume they have time to work on this type of issue.
>>> 
>>> S
>>> 
>>> On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid>
>>> wrote:
>>> 
>>> Is this something that Apache infra could help us with?
>>> 
>>> On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <s...@google.com.invalid>
>>> wrote:
>>> 
>>>> Summary:
>>>> 
>>>> For IO ITs that use data stores that need custom docker images in order
>>> to
>>>> run, we can't currently use them in a kubernetes cluster (which is where
>>> we
>>>> host our data stores.) I have a couple options for how to solve this and
>>> am
>>>> looking for feedback from folks involved in creating IO ITs/opinions on
>>>> kubernetes.
>>>> 
>>>> 
>>>> Details:
>>>> 
>>>> We've discussed in the past that we'll want to allow developers to submit
>>>> just a dockerfile, and then we'll use that when creating the data store
>>> on
>>>> kubernetes. This is the case for ElasticsearchIO and I assume more data
>>>> stores in the future will want to do this. It's also looking like it'll
>>> be
>>>> necessary to use custom docker images for the HadoopInputFormatIO's
>>>> cassandra ITs - to run a cassandra cluster, there doesn't seem to be a
>>> good
>>>> image you can use out of the box.
>>>> 
>>>> In either case, in order to retrieve a docker image, kubernetes needs a
>>>> container registry - it will read the docker images from there. A simple
>>>> private container registry doesn't work because kubernetes config files
>>> are
>>>> static - this means that if local devs try to use the kubernetes files,
>>>> they point at the private container registry and they wouldn't be able to
>>>> retrieve the images since they don't have access. They'd have to manually
>>>> edit the files, which in theory is an option, but I don't consider that
>>> to
>>>> be acceptable since it feels pretty unfriendly (it is simple, so if we
>>>> really don't like the below options we can revisit it.)
>>>> 
>>>> Quick summary of the options
>>>> 
>>>> =======================
>>>> 
>>>> We can:
>>>> 
>>>> * Start using something like k8 helm - this adds more dependencies, adds
>>> a
>>>> small amount of complexity (this is my recommendation, but only by a
>>>> little)
>>>> 
>>>> * Start pushing images to docker hub - this means they'll be publicly
>>>> visible and raises the bar for maintenance of those images
>>>> 
>>>> * Host our own public container registry - this means running our own
>>>> public service with costs, etc..
>>>> 
>>>> Below are detailed discussions of these options. You can skip to the "My
>>>> thoughts on this" section if you're not interested in the details.
>>>> 
>>>> 
>>>> 1. Templated kubernetes images
>>>> 
>>>> =========================
>>>> 
>>>> Kubernetes (k8) does not currently have built in support for
>>> parameterizing
>>>> scripts - there's an issues open for this[1], but it doesn't seem to be
>>>> very active.
>>>> 
>>>> There are tools like Kubernetes helm that allow users to specify
>>> parameters
>>>> when running their kubernetes scripts. They also enable a lot more
>>> (they're
>>>> probably closer to a package manager like apt-get) - see this
>>>> description[3] for an overview.
>>>> 
>>>> I'm open to other options besides helm, but it seems to be the officially
>>>> supported one.
>>>> 
>>>> How the world would look using helm:
>>>> 
>>>> * When developing an IO IT, someone (either the developer or one of us),
>>>> would need to create a chart (the name for the helm script) - it's
>>>> basically another set of config files but in theory is as simple as a
>>>> couple metadata files plus a templatized version of a regular k8 script.
>>>> This should be trivial compared to the task of creating a k8 script.
>>>> 
>>>> *  When creating an instance of a data store, the developer (or the beam
>>> CI
>>>> server) would first build the docker image for the data store and push to
>>>> their container registry, then run a command like `helm install -f
>>>> mydb.yaml --set imageRepo=1.2.3.4`
>>>> 
>>>> * when done running tests/developing/etc…  the developer/beam CI server
>>>> would run `helm delete -f mydb.yaml`
>>>> 
>>>> Upsides:
>>>> 
>>>> * Something like helm is pretty interesting - we talked about it as an
>>>> upside and something we wanted to do when we talked about using
>>> kubernetes
>>>> 
>>>> * We pick up a set of working kubernetes scripts this way. The full list
>>> is
>>>> at [2], but some ones that stood out: mongodb, memcached, mysql,
>>> postgres,
>>>> redis, elasticsearch (incubating), kafka (incubating), zookeeper
>>>> (incubating) - this could speed development
>>>> 
>>>> Downsides:
>>>> 
>>>> * Adds an additional dependency to run our ITs (helm or another k8
>>>> templating tool)
>>>> 
>>>> * Requires people to build their own images run a container registry if
>>>> they don't already have one (it will not surprise you that there's a
>>> docker
>>>> image for running the registry [0] - so it's not crazy. :) I *think* this
>>>> will probably just be a simple one/two line command once we have it
>>>> scripted.
>>>> 
>>>> * Helm in particular is kind of heavyweight for what we really need - it
>>>> requires running a service in the k8 cluster and adds additional
>>>> complexity.
>>>> 
>>>> * Adds to the complexity of creating a new kubernetes script. Until I've
>>>> tried it, I can't really speak to the complexity, but taking a look at
>>> the
>>>> instructions [4], it doesn't seem too bad.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 2. Push images to docker hub
>>>> 
>>>> =======================
>>>> 
>>>> This requires that users push images that we want to use to docker hub,
>>> and
>>>> then our IO ITs will rely on that. I  think the developer of the
>>> dockerfile
>>>> should be responsible for the image - having the beam project responsible
>>>> for a publicly available artifact (like the docker images) outside of our
>>>> core deliverables doesn't seem like the right move.
>>>> 
>>>> We would still retain a copy of the source dockerfiles and could
>>> regenerate
>>>> the images at any time, so I'm not concerned about a scenario where
>>> docker
>>>> hub went away (it would be pretty simple to switch to another repo - just
>>>> change some config files.)
>>>> 
>>>> For someone running the k8 scripts (ie, running the IO ITs), this is
>>> pretty
>>>> easy - they just run the k8 script like they do today.
>>>> 
>>>> For someone creating the k8 scripts (ie, creating the IO ITs), this is
>>> more
>>>> complex - either they or we have to push this to docker hub and make sure
>>>> it's up to date, etc..
>>>> 
>>>> 
>>>> Upsides:
>>>> 
>>>> * No additional complexity for IO IT runners.
>>>> 
>>>> Downsides:
>>>> 
>>>> * Higher bar for creating the image in the first place - someone has to
>>>> maintain the publicly available docker hub image.
>>>> 
>>>> * It seems weird to have a custom docker image up on docker hub - maybe
>>>> that's common, but if we need specific changes to images for our needs,
>>> I'd
>>>> prefer it be private.
>>>> 
>>>> 
>>>> 3. Run our own *public* container registry
>>>> 
>>>> ==============================================
>>>> 
>>>> We would run a beam-specific container registry service - it would be
>>> used
>>>> by the apache beam CI servers, but it would also be available for use by
>>>> anyone running beam IO ITs on their local dev setup.
>>>> 
>>>> From a IO IT creator's perspective, this would look pretty similar to how
>>>> things are now - they just check in a dockerfile. For someone running the
>>>> k8 scripts, they similarly don't need to think about it.
>>>> 
>>>> Upsides:
>>>> 
>>>> * we're not adding any additional complexity for end developer
>>>> 
>>>> Downsides:
>>>> 
>>>> * Have to keep docker registry software up to date
>>>> 
>>>> * The service is a single of failure for any beam devs running IO ITs
>>>> 
>>>> * It can incur costs, etc… As an open source project, it doesn't seem
>>> great
>>>> for us to be running a public service.
>>>> 
>>>> 
>>>> 
>>>> My thoughts on this
>>>> 
>>>> ===============
>>>> 
>>>> In spite of the additional complexity, I think using k8 helm is probably
>>>> the best option. The general goal behind the IO ITs has been to keep
>>>> ourselves self-contained: avoid having centralized infrastructure for
>>> those
>>>> running the ITs. Helm is a good match for those criteria. I will admit
>>> that
>>>> I find the additional dependencies/complexity to be worrisome. However, I
>>>> really like the idea of picking up additional data store configs for
>>> free -
>>>> if we were doing this in 5 years, we'd say "we should just use the
>>>> ecosystem of helm charts" and go from there.
>>>> 
>>>> I do think that pushing images to docker hub is a viable option, and if
>>> the
>>>> community is more excited to do that/wants to push the images there, I'd
>>>> support it. I can see how folks would be hesitant. I would like for the
>>>> developer of the docker file to do
>>>> 
>>>> Of the 3 options, I would strongly push back against running a public
>>>> container registry - I would not want to administer it, and I don't think
>>>> we as a project want to be paying for the costs associated with it.
>>>> 
>>>> Next steps
>>>> 
>>>> =========
>>>> 
>>>> Let me know what you think! This is definitely a topic where
>>> understanding
>>>> what the community of IO devs wants is helpful. As we discuss, I'll
>>>> probably spend a little time exploring helm since I want to play around
>>>> with it and understand if there are other drawbacks. I ran into this
>>>> question while working on getting the HIFIO cassandra cluster running,
>>> so I
>>>> might prototype with that.
>>>> 
>>>> I'll create JIRA for this in the next day or so.
>>>> 
>>>> Stephen
>>>> 
>>>> 
>>>> 
>>>> [0] docker registry container - https://hub.docker.com/_/registry/
>>>> 
>>>> [1] kubernetes issue open for supporting templates -
>>>> https://github.com/kubernetes/kubernetes/issues/23896
>>>> 
>>>> [2] set of available charts - https://github.com/kubernetes/charts
>>>> 
>>>> [3] kubernetes helm introduction -
>>>> https://deis.com/blog/2015/introducing-helm-for-kubernetes/
>>>> [4] kubernetes charts instructions -
>>>> https://github.com/kubernetes/helm/blob/master/docs/charts.md
> 
> -- 
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com

Re: IO ITs: Hosting Docker images

Reply via email to