[ https://issues.apache.org/jira/browse/HADOOP-14898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Elek, Marton updated HADOOP-14898: ---------------------------------- Attachment: HADOOP-14898.003.tgz Third version of the base image. It includes the support of Ozone SCM creation (can be turned on with env variable). > Create official Docker images for development and testing features > ------------------------------------------------------------------- > > Key: HADOOP-14898 > URL: https://issues.apache.org/jira/browse/HADOOP-14898 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Elek, Marton > Assignee: Elek, Marton > Attachments: HADOOP-14898.001.tar.gz, HADOOP-14898.002.tar.gz, > HADOOP-14898.003.tgz > > > This is the original mail from the mailing list: > {code} > TL;DR: I propose to create official hadoop images and upload them to the > dockerhub. > GOAL/SCOPE: I would like improve the existing documentation with easy-to-use > docker based recipes to start hadoop clusters with various configuration. > The images also could be used to test experimental features. For example > ozone could be tested easily with these compose file and configuration: > https://gist.github.com/elek/1676a97b98f4ba561c9f51fce2ab2ea6 > Or even the configuration could be included in the compose file: > https://github.com/elek/hadoop/blob/docker-2.8.0/example/docker-compose.yaml > I would like to create separated example compose files for federation, ha, > metrics usage, etc. to make it easier to try out and understand the features. > CONTEXT: There is an existing Jira > https://issues.apache.org/jira/browse/HADOOP-13397 > But it’s about a tool to generate production quality docker images (multiple > types, in a flexible way). If no objections, I will create a separated issue > to create simplified docker images for rapid prototyping and investigating > new features. And register the branch to the dockerhub to create the images > automatically. > MY BACKGROUND: I am working with docker based hadoop/spark clusters quite a > while and run them succesfully in different environments (kubernetes, > docker-swarm, nomad-based scheduling, etc.) My work is available from here: > https://github.com/flokkr but they could handle more complex use cases (eg. > instrumenting java processes with btrace, or read/reload configuration from > consul). > And IMHO in the official hadoop documentation it’s better to suggest to use > official apache docker images and not external ones (which could be changed). > {code} > The next list will enumerate the key decision points regarding to docker > image creating > A. automated dockerhub build / jenkins build > Docker images could be built on the dockerhub (a branch pattern should be > defined for a github repository and the location of the Docker files) or > could be built on a CI server and pushed. > The second one is more flexible (it's more easy to create matrix build, for > example) > The first one had the advantage that we can get an additional flag on the > dockerhub that the build is automated (and built from the source by the > dockerhub). > The decision is easy as ASF supports the first approach: (see > https://issues.apache.org/jira/browse/INFRA-12781?focusedCommentId=15824096&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15824096) > B. source: binary distribution or source build > The second question is about creating the docker image. One option is to > build the software on the fly during the creation of the docker image the > other one is to use the binary releases. > I suggest to use the second approach as: > 1. In that case the hadoop:2.7.3 could contain exactly the same hadoop > distrubution as the downloadable one > 2. We don't need to add development tools to the image, the image could be > more smaller (which is important as the goal for this image to getting > started as fast as possible) > 3. The docker definition will be more simple (and more easy to maintain) > Usually this approach is used in other projects (I checked Apache Zeppelin > and Apache Nutch) > C. branch usage > Other question is the location of the Docker file. It could be on the > official source-code branches (branch-2, trunk, etc.) or we can create > separated branches for the dockerhub (eg. docker/2.7 docker/2.8 docker/3.0) > For the first approach it's easier to find the docker images, but it's less > flexible. For example if we had a Dockerfile for on the source code it should > be used for every release (for example the Docker file from the tag > release-3.0.0 should be used for the 3.0 hadoop docker image). In that case > the release process is much more harder: in case of a Dockerfile error (which > could be test on dockerhub only after the taging), a new release should be > added after fixing the Dockerfile. > Another problem is that with using tags it's not possible to improve the > Dockerfiles. I can imagine that we would like to improve for example the > hadoop:2.7 images (for example adding more smart startup scripts) with using > exactly the same hadoop 2.7 distribution. > Finally with tag based approach we can't create images for the older releases > (2.8.1 for example) > So I suggest to create separated branches for the Dockerfiles. > D. Versions > We can create a separated branch for every version (2.7.1/2.7.2/2.7.3) or > just for the main version (2.8/2.7). As these docker images are not for the > production but for prototyping I suggest to use (at least as a first step) > just the 2.7/2.8 and update the images during the bugfix release. > E. Number of images > There are two options here, too: Create a separated image for every component > (namenode, datanode, etc.) or just one, and the command should be defined > everywhere manually. The second seems to be more complex (to use), but I > think the maintenance is easier, and it's more visible what should be started > F. Snapshots > According to the spirit of the Release policy: > https://www.apache.org/dev/release-distribution.html#unreleased > We should distribute only final releases to the dockerhub and not snapshots. > But we can create an empty hadoop-runner image as well, which container the > starter scripts but not hadoop. It would be used for development locally > where the newly built distribution could be maped to the image with docker > volumes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org