[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574455#comment-16574455 ] ASF GitHub Bot commented on FLINK-10063: asfgit closed pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/flink-jepsen/docker/Dockerfile-db b/flink-jepsen/docker/Dockerfile-db index 1555329af3f..cb60efce2e5 100644 --- a/flink-jepsen/docker/Dockerfile-db +++ b/flink-jepsen/docker/Dockerfile-db @@ -21,7 +21,7 @@ FROM debian:jessie RUN echo "deb http://http.debian.net/debian jessie-backports main" >> /etc/apt/sources.list && \ apt-get update && \ apt-get install -y -t jessie-backports openjdk-8-jdk && \ -apt-get install -y apt-utils bzip2 curl faketime iproute iptables iputils-ping less libzip2 logrotate man man-db net-tools ntpdate psmisc python rsyslog sudo sysvinit sysvinit-core sysvinit-utils tar unzip vim wget +apt-get install -y apt-utils bzip2 curl faketime iproute iptables iputils-ping less libzip2 logrotate man man-db net-tools ntpdate psmisc python rsyslog runit sudo sysvinit sysvinit-core sysvinit-utils tar unzip vim wget RUN apt-get update && \ apt-get -y install openssh-server && \ @@ -35,5 +35,12 @@ RUN mkdir -p /root/.ssh/ && \ chmod 600 /root/.ssh/authorized_keys && \ cat /root/id_rsa.pub >> /root/.ssh/authorized_keys +COPY sshd-run /etc/sv/service/sshd/run +RUN chmod +x /etc/sv/service/sshd/run && \ +ln -sf /etc/sv/service/sshd /etc/service + EXPOSE 22 -CMD exec /usr/sbin/sshd -D + +# Start runit process supervisor which will bring up sshd. +# In our tests we can use runit to supervise more processes, e.g., Mesos. +CMD runsvdir -P /etc/service /dev/null > /dev/null diff --git a/flink-jepsen/src/jepsen/flink/db.clj b/flink-jepsen/src/jepsen/flink/db.clj index 9a725d7149a..becc551e2cf 100644 --- a/flink-jepsen/src/jepsen/flink/db.clj +++ b/flink-jepsen/src/jepsen/flink/db.clj @@ -97,7 +97,7 @@ (if (cu/exists? log-dir) (cu/ls-full log-dir) [])) (defn flink-db - [test] + [] (reify db/DB (setup! [_ test node] (c/su @@ -131,7 +131,7 @@ [] (let [zk (zk/db deb-zookeeper-package) hadoop (hadoop/db hadoop-dist-url) -flink (flink-db test)] +flink (flink-db)] (combined-db [hadoop zk flink]))) (defn exec-flink! @@ -192,7 +192,7 @@ (let [zk (zk/db deb-zookeeper-package) hadoop (hadoop/db hadoop-dist-url) mesos (mesos/db deb-mesos-package deb-marathon-package) -flink (flink-db test)] +flink (flink-db)] (combined-db [hadoop zk mesos flink]))) (defn submit-job-with-retry! @@ -209,24 +209,25 @@ (let [r (fu/retry (fn [] (http/post (str (mesos/marathon-base-url test) "/v2/apps") - {:form-params {:id "flink" - :cmd (str "HADOOP_CLASSPATH=`" hadoop/install-dir "/bin/hadoop classpath` " - "HADOOP_CONF_DIR=" hadoop/hadoop-conf-dir " " - install-dir "/bin/mesos-appmaster.sh " - "-Dmesos.master=" (zookeeper-uri - test - mesos/zk-namespace) " " - "-Djobmanager.rpc.address=$(hostname -f) " - "-Djobmanager.heap.mb=2048 " - "-Djobmanager.rpc.port=6123 " - "-Djobmanager.web.port=8081 " - "-Dmesos.resourcemanager.tasks.mem=2048 " - "-Dtaskmanager.heap.mb=2048 " - "-Dtaskmanager.numberOfTaskSlots=2 " - "-Dmesos.resourcemanager.tasks.cpus=1 " - "-Drest.bind-address=$(hostname -f) ") - :cpus 1.0 - :mem 2048} + {:form-params {:id"flink" + :cmd (str "HADOOP_CLASSPATH=`" hadoop/install-dir "/bin/hadoop classpath` " +
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16574452#comment-16574452 ] ASF GitHub Bot commented on FLINK-10063: tillrohrmann commented on issue #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#issuecomment-411672800 Merging this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. > *Acceptance Criteria* > * Running tests with {{--deployment-mode mesos-session}} should not fail due > to reasons related to the Mesos setup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573300#comment-16573300 ] ASF GitHub Bot commented on FLINK-10063: GJL commented on a change in pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#discussion_r208608118 ## File path: flink-jepsen/docker/Dockerfile-db ## @@ -35,5 +35,12 @@ RUN mkdir -p /root/.ssh/ && \ chmod 600 /root/.ssh/authorized_keys && \ cat /root/id_rsa.pub >> /root/.ssh/authorized_keys +COPY sshd-run /etc/sv/service/sshd/run +RUN chmod +x /etc/sv/service/sshd/run && \ +ln -sf /etc/sv/service/sshd /etc/service + EXPOSE 22 -CMD exec /usr/sbin/sshd -D + +# Start runit process supervisor which will bring up sshd. +# In our tests we can use runit to supervise more processes, e.g., Mesos. +CMD runsvdir -P /etc/service /dev/null > /dev/null Review comment: Yes is needed. It only redirects std err: >>> If the log argument is given to runsvdir, all output to standard error is redirected to this log This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. > *Acceptance Criteria* > * Running tests with {{--deployment-mode mesos-session}} should not fail due > to reasons related to the Mesos setup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573302#comment-16573302 ] ASF GitHub Bot commented on FLINK-10063: GJL commented on a change in pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#discussion_r208608118 ## File path: flink-jepsen/docker/Dockerfile-db ## @@ -35,5 +35,12 @@ RUN mkdir -p /root/.ssh/ && \ chmod 600 /root/.ssh/authorized_keys && \ cat /root/id_rsa.pub >> /root/.ssh/authorized_keys +COPY sshd-run /etc/sv/service/sshd/run +RUN chmod +x /etc/sv/service/sshd/run && \ +ln -sf /etc/sv/service/sshd /etc/service + EXPOSE 22 -CMD exec /usr/sbin/sshd -D + +# Start runit process supervisor which will bring up sshd. +# In our tests we can use runit to supervise more processes, e.g., Mesos. +CMD runsvdir -P /etc/service /dev/null > /dev/null Review comment: Yes is needed. It only redirects std err: > If the log argument is given to runsvdir, all output to standard error is redirected to this log This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. > *Acceptance Criteria* > * Running tests with {{--deployment-mode mesos-session}} should not fail due > to reasons related to the Mesos setup. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572756#comment-16572756 ] ASF GitHub Bot commented on FLINK-10063: GJL commented on a change in pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#discussion_r208471471 ## File path: flink-jepsen/src/jepsen/flink/mesos.clj ## @@ -24,11 +24,35 @@ [jepsen.os.debian :as debian] [jepsen.flink.zookeeper :refer [zookeeper-uri]])) +;;; runit process supervisor (http://smarden.org/runit/) +;;; +;;; We use runit to supervise Mesos processes because Mesos uses a "fail-fast" approach to +;;; error handling, e.g., the Mesos master will exit when it discovers it has been partitioned away +;;; from the Zookeeper quorum. + +(def runit-version "2.1.2-3") + +(defn create-supervised-service! + "Registers a service with the process supervisor and starts it." + [service-name cmd] + (let [service-dir (str "/etc/sv/" service-name) +run-script (str service-dir "/run")] +(c/su + (c/exec :mkdir :-p service-dir) + (c/exec :echo (clojure.string/join "\n" ["#!/bin/sh" cmd]) :> run-script) Review comment: I'll fix this. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569829#comment-16569829 ] ASF GitHub Bot commented on FLINK-10063: cewood commented on a change in pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#discussion_r207793002 ## File path: flink-jepsen/src/jepsen/flink/mesos.clj ## @@ -24,11 +24,35 @@ [jepsen.os.debian :as debian] [jepsen.flink.zookeeper :refer [zookeeper-uri]])) +;;; runit process supervisor (http://smarden.org/runit/) +;;; +;;; We use runit to supervise Mesos processes because Mesos uses a "fail-fast" approach to +;;; error handling, e.g., the Mesos master will exit when it discovers it has been partitioned away +;;; from the Zookeeper quorum. + +(def runit-version "2.1.2-3") + +(defn create-supervised-service! + "Registers a service with the process supervisor and starts it." + [service-name cmd] + (let [service-dir (str "/etc/sv/" service-name) +run-script (str service-dir "/run")] +(c/su + (c/exec :mkdir :-p service-dir) + (c/exec :echo (clojure.string/join "\n" ["#!/bin/sh" cmd]) :> run-script) Review comment: It's generally considered best practice for runit units to include an `exec 2>&1` line, and to prefix your command with `exec ...`. So I'd suggest updating this line accordingly; `["#!/bin/sh" "exec 2>&1" (str "exec " cmd)]` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569828#comment-16569828 ] ASF GitHub Bot commented on FLINK-10063: cewood commented on a change in pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#discussion_r207790888 ## File path: flink-jepsen/docker/Dockerfile-db ## @@ -35,5 +35,12 @@ RUN mkdir -p /root/.ssh/ && \ chmod 600 /root/.ssh/authorized_keys && \ cat /root/id_rsa.pub >> /root/.ssh/authorized_keys +COPY sshd-run /etc/sv/service/sshd/run +RUN chmod +x /etc/sv/service/sshd/run && \ +ln -sf /etc/sv/service/sshd /etc/service + EXPOSE 22 -CMD exec /usr/sbin/sshd -D + +# Start runit process supervisor which will bring up sshd. +# In our tests we can use runit to supervise more processes, e.g., Mesos. +CMD runsvdir -P /etc/service /dev/null > /dev/null Review comment: Is the extra `> /dev/null` actually required? I would have expected that the log argument to `/dev/null` alone would have sufficed, since it also redirects standard error according to the docs. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569830#comment-16569830 ] ASF GitHub Bot commented on FLINK-10063: cewood commented on issue #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496#issuecomment-410615164 And nice work on this, it's super tedious doing all this setup and tear down stuff, nice job :100: This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-10063) Jepsen: Automatically restart Mesos Processes
[ https://issues.apache.org/jira/browse/FLINK-10063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569458#comment-16569458 ] ASF GitHub Bot commented on FLINK-10063: GJL opened a new pull request #6496: [FLINK-10063][tests] Use runit to supervise mesos processes. URL: https://github.com/apache/flink/pull/6496 ## What is the purpose of the change *Use a process supervisor to automatically restart Mesos processes. This is needed because Mesos uses a "fail-fast" approach to error handling, e.g., the Mesos master will exit when it discovers it has been partitioned away from the Zookeeper quorum. Currently the some of the tests cannot pass because the Mesos processes exiting.* cc: @igalshilman @cewood @tillrohrmann ## Brief change log - *Use runit to supervise Mesos processes.* - *Make docker setup work.* ## Verifying this change This change added tests and can be verified as follows: - *Ran Mesos tests on docker.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (**yes** (in test code) / no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Jepsen: Automatically restart Mesos Processes > - > > Key: FLINK-10063 > URL: https://issues.apache.org/jira/browse/FLINK-10063 > Project: Flink > Issue Type: Bug > Components: Tests >Affects Versions: 1.6.0 >Reporter: Gary Yao >Assignee: Gary Yao >Priority: Critical > Labels: pull-request-available > Fix For: 1.6.1, 1.7.0 > > > Use a process supervisor to automatically restart Mesos processes. This is > needed because Mesos uses a "fail-fast" approach to error handling, e.g., the > Mesos master will exit when it discovers it has been partitioned away from > the Zookeeper quorum. Currently the some of the tests cannot pass because the > Mesos processes exiting. -- This message was sent by Atlassian JIRA (v7.6.3#76005)