[jira] [Updated] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine
[ https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Su updated SUBMARINE-457: --- Target Version: 0.6.0 (was: 0.5.0) > Run TF MNIST example using Docker Container failed in mini-submarine > - > > Key: SUBMARINE-457 > URL: https://issues.apache.org/jira/browse/SUBMARINE-457 > Project: Apache Submarine > Issue Type: Bug > Components: Mini Submarine >Affects Versions: 0.4.0 >Reporter: Ryan Lo >Assignee: Ryan Lo >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > I tried to run mnist_distributed.py using docker container, and launch failed. > The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 > was build in advance in mini-submarine. > {code:java} > java -cp $(hadoop classpath > --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar > org.apache.submarine.client.cli.Cli job run --name tf-job-001 \ > --framework tensorflow \ > --docker_image tf-1.13.1-cpu-base:0.0.1 \ > --input_path "" \ > --num_ps 1 \ > --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath > --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data > --working_dir /tmp/mode" \ > --ps_resources memory=1G,vcores=1 \ > --num_workers 2 \ > --worker_resources memory=1G,vcores=1 \ > --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop > classpath --glob) && python mnist_distributed.py --steps 2 --data_dir > /tmp/data --working_dir /tmp/mode" \ > --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env HADOOP_HOME=/hadoop-current \ > --env HADOOP_YARN_HOME=/hadoop-current \ > --env HADOOP_COMMON_HOME=hadoop-current \ > --env HADOOP_HDFS_HOME=/hadoop-current \ > --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \ > --conf > tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py > {code} > The following is partial NodeManager log. > {code:java} > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1585136148243_0006_01_01 transitioned from SCHEDULED > to RUNNING > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Starting resource-monitoring for container_1585136148243_0006_01_01 > 2020-03-25 13:48:32,740 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > setting hostname in container to: ctr-1585136148243-0006-01-01 > 2020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > Docker inspect output for container_1585136148243_0006_01_01: > ,ctr-1585136148243-0006-01-012020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > container_1585136148243_0006_01_01's ip = , and hostname = > ctr-1585136148243-0006-01-01 > 2020-03-25 13:48:34,613 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Skipping monitoring container container_1585136148243_0006_01_01 since > CPU usage is not yet available. > 2020-03-25 13:48:36,234 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Stderr: > Docker container exit code was not zero: 255 > Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command > provided 4 > main : run as user is yarn > main : requested yarn user is yarn > Creating script paths... > Creating local dirs... > Getting exit code file... > Changing effective user to root... > Launching docker container... > Inspecting docker container... > Writing to cgroup task files... > Writing pid file... > Writing to tmp file > /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_01/container_1585136148243_0006_01_01.pid.tmp > container_1585136148243_0006_01_01 > Waiting for docker container to finish... > Removing docker container post-exit... > {code} > The following is AM stdout.log. > {code:java} > > LogType:amstdout.log > LogLastModifiedTime:Wed Mar 25 13:02:27 + 2020 > LogLength:6468 > LogContents: > [WARN ] 2020-03-25 13:02:25,503 >
[jira] [Updated] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine
[ https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wanqiang Ji updated SUBMARINE-457: -- Target Version: 0.5.0 (was: 0.4.0) > Run TF MNIST example using Docker Container failed in mini-submarine > - > > Key: SUBMARINE-457 > URL: https://issues.apache.org/jira/browse/SUBMARINE-457 > Project: Apache Submarine > Issue Type: Bug > Components: Mini Submarine >Affects Versions: 0.4.0 >Reporter: Ryan Lo >Assignee: Ryan Lo >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > I tried to run mnist_distributed.py using docker container, and launch failed. > The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 > was build in advance in mini-submarine. > {code:java} > java -cp $(hadoop classpath > --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar > org.apache.submarine.client.cli.Cli job run --name tf-job-001 \ > --framework tensorflow \ > --docker_image tf-1.13.1-cpu-base:0.0.1 \ > --input_path "" \ > --num_ps 1 \ > --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath > --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data > --working_dir /tmp/mode" \ > --ps_resources memory=1G,vcores=1 \ > --num_workers 2 \ > --worker_resources memory=1G,vcores=1 \ > --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop > classpath --glob) && python mnist_distributed.py --steps 2 --data_dir > /tmp/data --working_dir /tmp/mode" \ > --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env HADOOP_HOME=/hadoop-current \ > --env HADOOP_YARN_HOME=/hadoop-current \ > --env HADOOP_COMMON_HOME=hadoop-current \ > --env HADOOP_HDFS_HOME=/hadoop-current \ > --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \ > --conf > tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py > {code} > The following is partial NodeManager log. > {code:java} > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1585136148243_0006_01_01 transitioned from SCHEDULED > to RUNNING > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Starting resource-monitoring for container_1585136148243_0006_01_01 > 2020-03-25 13:48:32,740 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > setting hostname in container to: ctr-1585136148243-0006-01-01 > 2020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > Docker inspect output for container_1585136148243_0006_01_01: > ,ctr-1585136148243-0006-01-012020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > container_1585136148243_0006_01_01's ip = , and hostname = > ctr-1585136148243-0006-01-01 > 2020-03-25 13:48:34,613 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Skipping monitoring container container_1585136148243_0006_01_01 since > CPU usage is not yet available. > 2020-03-25 13:48:36,234 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Stderr: > Docker container exit code was not zero: 255 > Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command > provided 4 > main : run as user is yarn > main : requested yarn user is yarn > Creating script paths... > Creating local dirs... > Getting exit code file... > Changing effective user to root... > Launching docker container... > Inspecting docker container... > Writing to cgroup task files... > Writing pid file... > Writing to tmp file > /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_01/container_1585136148243_0006_01_01.pid.tmp > container_1585136148243_0006_01_01 > Waiting for docker container to finish... > Removing docker container post-exit... > {code} > The following is AM stdout.log. > {code:java} > > LogType:amstdout.log > LogLastModifiedTime:Wed Mar 25 13:02:27 + 2020 > LogLength:6468 > LogContents: > [WARN ] 2020-03-25 13:02:25,503 >
[jira] [Updated] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine
[ https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SUBMARINE-457: - Labels: pull-request-available (was: ) > Run TF MNIST example using Docker Container failed in mini-submarine > - > > Key: SUBMARINE-457 > URL: https://issues.apache.org/jira/browse/SUBMARINE-457 > Project: Apache Submarine > Issue Type: Bug > Components: Mini Submarine >Affects Versions: 0.4.0 >Reporter: Ryan Lo >Assignee: Ryan Lo >Priority: Major > Labels: pull-request-available > > I tried to run mnist_distributed.py using docker container, and launch failed. > The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 > was build in advance in mini-submarine. > {code:java} > java -cp $(hadoop classpath > --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar > org.apache.submarine.client.cli.Cli job run --name tf-job-001 \ > --framework tensorflow \ > --docker_image tf-1.13.1-cpu-base:0.0.1 \ > --input_path "" \ > --num_ps 1 \ > --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath > --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data > --working_dir /tmp/mode" \ > --ps_resources memory=1G,vcores=1 \ > --num_workers 2 \ > --worker_resources memory=1G,vcores=1 \ > --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop > classpath --glob) && python mnist_distributed.py --steps 2 --data_dir > /tmp/data --working_dir /tmp/mode" \ > --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env HADOOP_HOME=/hadoop-current \ > --env HADOOP_YARN_HOME=/hadoop-current \ > --env HADOOP_COMMON_HOME=hadoop-current \ > --env HADOOP_HDFS_HOME=/hadoop-current \ > --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \ > --conf > tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py > {code} > The following is partial NodeManager log. > {code:java} > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1585136148243_0006_01_01 transitioned from SCHEDULED > to RUNNING > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Starting resource-monitoring for container_1585136148243_0006_01_01 > 2020-03-25 13:48:32,740 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > setting hostname in container to: ctr-1585136148243-0006-01-01 > 2020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > Docker inspect output for container_1585136148243_0006_01_01: > ,ctr-1585136148243-0006-01-012020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > container_1585136148243_0006_01_01's ip = , and hostname = > ctr-1585136148243-0006-01-01 > 2020-03-25 13:48:34,613 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Skipping monitoring container container_1585136148243_0006_01_01 since > CPU usage is not yet available. > 2020-03-25 13:48:36,234 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Stderr: > Docker container exit code was not zero: 255 > Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command > provided 4 > main : run as user is yarn > main : requested yarn user is yarn > Creating script paths... > Creating local dirs... > Getting exit code file... > Changing effective user to root... > Launching docker container... > Inspecting docker container... > Writing to cgroup task files... > Writing pid file... > Writing to tmp file > /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_01/container_1585136148243_0006_01_01.pid.tmp > container_1585136148243_0006_01_01 > Waiting for docker container to finish... > Removing docker container post-exit... > {code} > The following is AM stdout.log. > {code:java} > > LogType:amstdout.log > LogLastModifiedTime:Wed Mar 25 13:02:27 + 2020 > LogLength:6468 > LogContents: > [WARN ] 2020-03-25 13:02:25,503 >