[ https://issues.apache.org/jira/browse/SUBMARINE-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SUBMARINE-457: ------------------------------------- Labels: pull-request-available (was: ) > Run TF MNIST example using Docker Container failed in mini-submarine > --------------------------------------------------------------------- > > Key: SUBMARINE-457 > URL: https://issues.apache.org/jira/browse/SUBMARINE-457 > Project: Apache Submarine > Issue Type: Bug > Components: Mini Submarine > Affects Versions: 0.4.0 > Reporter: Ryan Lo > Assignee: Ryan Lo > Priority: Major > Labels: pull-request-available > > I tried to run mnist_distributed.py using docker container, and launch failed. > The following is my command, and the docker image tf-1.13.1-cpu-base:0.0.1 > was build in advance in mini-submarine. > {code:java} > java -cp $(hadoop classpath > --glob):/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar > org.apache.submarine.client.cli.Cli job run --name tf-job-001 \ > --framework tensorflow \ > --docker_image tf-1.13.1-cpu-base:0.0.1 \ > --input_path "" \ > --num_ps 1 \ > --ps_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop classpath > --glob) && python mnist_distributed.py --steps 2 --data_dir /tmp/data > --working_dir /tmp/mode" \ > --ps_resources memory=1G,vcores=1 \ > --num_workers 2 \ > --worker_resources memory=1G,vcores=1 \ > --worker_launch_cmd "export CLASSPATH=\$(/hadoop-current/bin/hadoop > classpath --glob) && python mnist_distributed.py --steps 2 --data_dir > /tmp/data --working_dir /tmp/mode" \ > --env JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env DOCKER_HADOOP_HDFS_HOME=/hadoop-current \ > --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 \ > --env HADOOP_HOME=/hadoop-current \ > --env HADOOP_YARN_HOME=/hadoop-current \ > --env HADOOP_COMMON_HOME=hadoop-current \ > --env HADOOP_HDFS_HOME=/hadoop-current \ > --env HADOOP_CONF_DIR=/hadoop-current/etc/hadoop \ > --conf > tony.containers.resources=/opt/submarine-current/submarine-all-0.3.0-SNAPSHOT-hadoop-3.2.jar,/home/yarn/submarine/mnist_distributed.py > {code} > The following is partial NodeManager log. > {code:java} > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: > Container container_1585136148243_0006_01_000001 transitioned from SCHEDULED > to RUNNING > 2020-03-25 13:48:32,728 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Starting resource-monitoring for container_1585136148243_0006_01_000001 > 2020-03-25 13:48:32,740 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > setting hostname in container to: ctr-1585136148243-0006-01-000001 > 2020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.runtime.DockerLinuxContainerRuntime: > Docker inspect output for container_1585136148243_0006_01_000001: > ,ctr-1585136148243-0006-01-0000012020-03-25 13:48:34,605 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > container_1585136148243_0006_01_000001's ip = , and hostname = > ctr-1585136148243-0006-01-000001 > 2020-03-25 13:48:34,613 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: > Skipping monitoring container container_1585136148243_0006_01_000001 since > CPU usage is not yet available. > 2020-03-25 13:48:36,234 WARN > org.apache.hadoop.yarn.server.nodemanager.containermanager.linux.privileged.PrivilegedOperationExecutor: > Shell execution returned exit code: 255. Privileged Execution Operation > Stderr: > Docker container exit code was not zero: 255 > Unable to read from docker logs(ferror, feof): 0 1Stdout: main : command > provided 4 > main : run as user is yarn > main : requested yarn user is yarn > Creating script paths... > Creating local dirs... > Getting exit code file... > Changing effective user to root... > Launching docker container... > Inspecting docker container... > Writing to cgroup task files... > Writing pid file... > Writing to tmp file > /var/lib/hadoop-yarn/cache/yarn/nm-local-dir/nmPrivate/application_1585136148243_0006/container_1585136148243_0006_01_000001/container_1585136148243_0006_01_000001.pid.tmp > container_1585136148243_0006_01_000001 > Waiting for docker container to finish... > Removing docker container post-exit... > {code} > The following is AM stdout.log. > {code:java} > ======================================================================== > LogType:amstdout.log > LogLastModifiedTime:Wed Mar 25 13:02:27 +0000 2020 > LogLength:6468 > LogContents: > [WARN ] 2020-03-25 13:02:25,503 > method:org.apache.hadoop.util.NativeCodeLoader.<clinit>(NativeCodeLoader.java:60) > Unable to load native-hadoop library for your platform... using builtin-java > classes where applicable > [ERROR] 2020-03-25 13:02:25,613 > method:com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:217) > Failed to create FileSystem object > org.apache.hadoop.security.KerberosAuthException: failure to login: > javax.security.auth.login.LoginException: java.lang.NullPointerException: > invalid null input: name > at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71) > at > com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) > at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926) > at > org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837) > at > org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) > at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487) > at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) > at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215) > at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305) > at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293) > at > org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1847) > at > org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) > at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487) > at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) > at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215) > at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305) > at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293) > Caused by: javax.security.auth.login.LoginException: > java.lang.NullPointerException: invalid null input: name > at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:71) > at > com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:133) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755) > at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926) > at > org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837) > at > org.apache.hadoop.security.UserGroupInformation.createLoginUser(UserGroupInformation.java:710) > at > org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:660) > at > org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:571) > at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3487) > at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:3477) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3319) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:227) > at com.linkedin.tony.ApplicationMaster.init(ApplicationMaster.java:215) > at com.linkedin.tony.ApplicationMaster.run(ApplicationMaster.java:305) > at com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:293) > at javax.security.auth.login.LoginContext.invoke(LoginContext.java:856) > at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682) > at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680) > at javax.security.auth.login.LoginContext.login(LoginContext.java:587) > at > org.apache.hadoop.security.UserGroupInformation$HadoopLoginContext.login(UserGroupInformation.java:1926) > at > org.apache.hadoop.security.UserGroupInformation.doSubjectLogin(UserGroupInformation.java:1837) > ... 11 more > [INFO ] 2020-03-25 13:02:25,618 > method:com.linkedin.tony.ApplicationMaster.main(ApplicationMaster.java:298) > Application Master failed. Exiting > End of LogType:amstdout.log > *****************************************************************************{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@submarine.apache.org For additional commands, e-mail: dev-h...@submarine.apache.org