[jira] [Updated] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine

2020-09-21 Thread Kevin Su (Jira)


 [ 
https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Su updated SUBMARINE-457:
---
Target Version: 0.6.0  (was: 0.5.0)

> Run TF MNIST example using Docker Container failed in mini-submarine 
> -
>
> Key: SUBMARINE-457
> URL: https://issues.apache.org/jira/browse/SUBMARINE-457
> Project: Apache Submarine
>  Issue Type: Bug
>  Components: Mini Submarine
>Affects Versions: 0.4.0
>Reporter: Ryan Lo
>Assignee: Ryan Lo
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I tried to run mnist_distributed.py using docker container, and launch failed.
> The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 
> was build in advance in mini-submarine.
> {code:java}
> java -cp $(hadoop classpath 
> --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar 
> org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
>  --framework tensorflow \
>  --docker_image tf-1.13.1-cpu-base:0.0.1 \
>  --input_path "" \
>  --num_ps 1 \
>  --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath 
> --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data 
> --working_dir /tmp/mode" \
>  --ps_resources memory=1G,vcores=1 \
>  --num_workers 2 \
>  --worker_resources memory=1G,vcores=1 \
>  --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop 
> classpath --glob) && python mnist_distributed.py --steps 2 --data_dir 
> /tmp/data --working_dir /tmp/mode" \
>  --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env HADOOP_HOME=/hadoop-current \
>  --env HADOOP_YARN_HOME=/hadoop-current \
>  --env HADOOP_COMMON_HOME=hadoop-current \
>  --env HADOOP_HDFS_HOME=/hadoop-current \
>  --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \
>  --conf 
> tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
> {code}
> The following is partial NodeManager log.
> {code:java}
> 2020-03-25 13:48:32,728 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1585136148243_0006_01_01 transitioned from SCHEDULED 
> to RUNNING
> 2020-03-25 13:48:32,728 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Starting resource-monitoring for container_1585136148243_0006_01_01
> 2020-03-25 13:48:32,740 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>  setting hostname in container to: ctr-1585136148243-0006-01-01
> 2020-03-25 13:48:34,605 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>  Docker inspect output for container_1585136148243_0006_01_01: 
> ,ctr-1585136148243-0006-01-012020-03-25 13:48:34,605 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  container_1585136148243_0006_01_01's ip = , and hostname = 
> ctr-1585136148243-0006-01-01
> 2020-03-25 13:48:34,613 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Skipping monitoring container container_1585136148243_0006_01_01 since 
> CPU usage is not yet available.
> 2020-03-25 13:48:36,234 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Stderr:
> Docker container exit code was not zero: 255
> Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command 
> provided 4
> main : run as user is yarn
> main : requested yarn user is yarn
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Launching docker container...
> Inspecting docker container...
> Writing to cgroup task files...
> Writing pid file...
> Writing to tmp file 
> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_01/container_1585136148243_0006_01_01.pid.tmp
> container_1585136148243_0006_01_01
> Waiting for docker container to finish...
> Removing docker container post-exit...
> {code}
> The following is AM stdout.log.
> {code:java}
> 
> LogType:amstdout.log
> LogLastModifiedTime:Wed Mar 25 13:02:27 + 2020
> LogLength:6468
> LogContents:
> [WARN ] 2020-03-25 13:02:25,503 
> 

[jira] [Updated] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine

2020-06-25 Thread Wanqiang Ji (Jira)


 [ 
https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wanqiang Ji updated SUBMARINE-457:
--
Target Version: 0.5.0  (was: 0.4.0)

> Run TF MNIST example using Docker Container failed in mini-submarine 
> -
>
> Key: SUBMARINE-457
> URL: https://issues.apache.org/jira/browse/SUBMARINE-457
> Project: Apache Submarine
>  Issue Type: Bug
>  Components: Mini Submarine
>Affects Versions: 0.4.0
>Reporter: Ryan Lo
>Assignee: Ryan Lo
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I tried to run mnist_distributed.py using docker container, and launch failed.
> The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 
> was build in advance in mini-submarine.
> {code:java}
> java -cp $(hadoop classpath 
> --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar 
> org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
>  --framework tensorflow \
>  --docker_image tf-1.13.1-cpu-base:0.0.1 \
>  --input_path "" \
>  --num_ps 1 \
>  --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath 
> --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data 
> --working_dir /tmp/mode" \
>  --ps_resources memory=1G,vcores=1 \
>  --num_workers 2 \
>  --worker_resources memory=1G,vcores=1 \
>  --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop 
> classpath --glob) && python mnist_distributed.py --steps 2 --data_dir 
> /tmp/data --working_dir /tmp/mode" \
>  --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env HADOOP_HOME=/hadoop-current \
>  --env HADOOP_YARN_HOME=/hadoop-current \
>  --env HADOOP_COMMON_HOME=hadoop-current \
>  --env HADOOP_HDFS_HOME=/hadoop-current \
>  --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \
>  --conf 
> tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
> {code}
> The following is partial NodeManager log.
> {code:java}
> 2020-03-25 13:48:32,728 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1585136148243_0006_01_01 transitioned from SCHEDULED 
> to RUNNING
> 2020-03-25 13:48:32,728 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Starting resource-monitoring for container_1585136148243_0006_01_01
> 2020-03-25 13:48:32,740 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>  setting hostname in container to: ctr-1585136148243-0006-01-01
> 2020-03-25 13:48:34,605 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>  Docker inspect output for container_1585136148243_0006_01_01: 
> ,ctr-1585136148243-0006-01-012020-03-25 13:48:34,605 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  container_1585136148243_0006_01_01's ip = , and hostname = 
> ctr-1585136148243-0006-01-01
> 2020-03-25 13:48:34,613 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Skipping monitoring container container_1585136148243_0006_01_01 since 
> CPU usage is not yet available.
> 2020-03-25 13:48:36,234 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Stderr:
> Docker container exit code was not zero: 255
> Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command 
> provided 4
> main : run as user is yarn
> main : requested yarn user is yarn
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Launching docker container...
> Inspecting docker container...
> Writing to cgroup task files...
> Writing pid file...
> Writing to tmp file 
> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_01/container_1585136148243_0006_01_01.pid.tmp
> container_1585136148243_0006_01_01
> Waiting for docker container to finish...
> Removing docker container post-exit...
> {code}
> The following is AM stdout.log.
> {code:java}
> 
> LogType:amstdout.log
> LogLastModifiedTime:Wed Mar 25 13:02:27 + 2020
> LogLength:6468
> LogContents:
> [WARN ] 2020-03-25 13:02:25,503 
> 

[jira] [Updated] (SUBMARINE-457) Run TF MNIST example using Docker Container failed in mini-submarine

2020-04-14 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SUBMARINE-457:
-
Labels: pull-request-available  (was: )

> Run TF MNIST example using Docker Container failed in mini-submarine 
> -
>
> Key: SUBMARINE-457
> URL: https://issues.apache.org/jira/browse/SUBMARINE-457
> Project: Apache Submarine
>  Issue Type: Bug
>  Components: Mini Submarine
>Affects Versions: 0.4.0
>Reporter: Ryan Lo
>Assignee: Ryan Lo
>Priority: Major
>  Labels: pull-request-available
>
> I tried to run mnist_distributed.py using docker container, and launch failed.
> The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 
> was build in advance in mini-submarine.
> {code:java}
> java -cp $(hadoop classpath 
> --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar 
> org.apache.submarine.client.cli.Cli job run --name tf-job-001 \
>  --framework tensorflow \
>  --docker_image tf-1.13.1-cpu-base:0.0.1 \
>  --input_path "" \
>  --num_ps 1 \
>  --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath 
> --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data 
> --working_dir /tmp/mode" \
>  --ps_resources memory=1G,vcores=1 \
>  --num_workers 2 \
>  --worker_resources memory=1G,vcores=1 \
>  --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop 
> classpath --glob) && python mnist_distributed.py --steps 2 --data_dir 
> /tmp/data --working_dir /tmp/mode" \
>  --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \
>  --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \
>  --env HADOOP_HOME=/hadoop-current \
>  --env HADOOP_YARN_HOME=/hadoop-current \
>  --env HADOOP_COMMON_HOME=hadoop-current \
>  --env HADOOP_HDFS_HOME=/hadoop-current \
>  --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \
>  --conf 
> tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py
> {code}
> The following is partial NodeManager log.
> {code:java}
> 2020-03-25 13:48:32,728 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl:
>  Container container_1585136148243_0006_01_01 transitioned from SCHEDULED 
> to RUNNING
> 2020-03-25 13:48:32,728 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Starting resource-monitoring for container_1585136148243_0006_01_01
> 2020-03-25 13:48:32,740 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>  setting hostname in container to: ctr-1585136148243-0006-01-01
> 2020-03-25 13:48:34,605 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime:
>  Docker inspect output for container_1585136148243_0006_01_01: 
> ,ctr-1585136148243-0006-01-012020-03-25 13:48:34,605 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  container_1585136148243_0006_01_01's ip = , and hostname = 
> ctr-1585136148243-0006-01-01
> 2020-03-25 13:48:34,613 INFO 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl:
>  Skipping monitoring container container_1585136148243_0006_01_01 since 
> CPU usage is not yet available.
> 2020-03-25 13:48:36,234 WARN 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor:
>  Shell execution returned exit code: 255. Privileged Execution Operation 
> Stderr:
> Docker container exit code was not zero: 255
> Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command 
> provided 4
> main : run as user is yarn
> main : requested yarn user is yarn
> Creating script paths...
> Creating local dirs...
> Getting exit code file...
> Changing effective user to root...
> Launching docker container...
> Inspecting docker container...
> Writing to cgroup task files...
> Writing pid file...
> Writing to tmp file 
> /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_01/container_1585136148243_0006_01_01.pid.tmp
> container_1585136148243_0006_01_01
> Waiting for docker container to finish...
> Removing docker container post-exit...
> {code}
> The following is AM stdout.log.
> {code:java}
> 
> LogType:amstdout.log
> LogLastModifiedTime:Wed Mar 25 13:02:27 + 2020
> LogLength:6468
> LogContents:
> [WARN ] 2020-03-25 13:02:25,503 
>