Hi Safder, Seems like you are not subscribed to the twill-dev mailing list. By hitting reply on the mail sent to dev@ would reply to dev@ only. You can subscribe by sending an email to [email protected].
For the problem that you see, It's caused by NodeManager killing the container because of virtual memory usage is too high (not heap memory). You can easily fix that by setting a higher ratio for the "yarn.nodemanager.vmem-pmem-ratio" property (say 5.1) or simply just turn the check off if it is desired (set "yarn.nodemanager.vmem-check-enabled" to false). Both settings can be done in yarn-site.xml. Terence On Fri, Mar 21, 2014 at 11:48 AM, safder <[email protected]> wrote: > Hi Terrence, > > Don't know exactly why I don't get your emails. > > After looking at your reply, I was going over all the logs. I don't see much. > The first attempt for application Master just fails with exit code 143 > > Here are some logs for it. > > > 014-03-21 14:35:24,768 WARN monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(435)) - Container > [pid=5409,containerID=container_1395355306546_0023_01_000001] is running > beyond virtual memory limits. Current usage: 173.9 MB of 512 MB physical > memory used; 1.1 GB of 1.0 GB virtual memory used. Killing container. > Dump of the process-tree for container_1395355306546_0023_01_000001 : > |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) > SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE > |- 5419 5409 5409 5409 (java) 533 27 1123053568 44208 java > -Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023 > -Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m > org.apache.twill.launcher.TwillLauncher appMaster.jar > org.apache.twill.internal.appmaster.ApplicationMasterMain false > |- 5409 25677 5409 5409 (bash) 0 0 108650496 307 /bin/bash -c java > -Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023 > -Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m > org.apache.twill.launcher.TwillLauncher appMaster.jar > org.apache.twill.internal.appmaster.ApplicationMasterMain false > 1>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stdout > > 2>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stderr > > 2014-03-21 14:35:24,768 INFO monitor.ContainersMonitorImpl > (ContainersMonitorImpl.java:run(445)) - Removed ProcessTree with root 5409 > 2014-03-21 14:35:24,768 INFO container.Container > (ContainerImpl.java:handle(871)) - Container > container_1395355306546_0023_01_000001 transitioned from RUNNING to KILLING > 2014-03-21 14:35:24,769 INFO launcher.ContainerLaunch > (ContainerLaunch.java:cleanupContainer(341)) - Cleaning up container > container_1395355306546_0023_01_000001 > 2014-03-21 14:35:24,773 WARN nodemanager.DefaultContainerExecutor > (DefaultContainerExecutor.java:launchContainer(207)) - Exit code from > container container_1395355306546_0023_01_000001 is : 143 > 2014-03-21 14:35:24,796 INFO container.Container > (ContainerImpl.java:handle(871)) - Container > container_1395355306546_0023_01_000001 transitioned from KILLING to > CONTAINER_CLEANEDUP_AFTER_KILL > > > > I don't see much in either the application logs or the logs generated on the > client. Could you also direct me if I should be looking at any particular log? > > I am looking at > 1: resourcemanger logs (says first attempt exited with exit code 143 as above) > 2: nodemanger logs (says first attempt exited with exit code 143 as above) > 3: twill client logs (just goes from starting to killed) > 4: yarn application logs (yarn logs -applicationId) (same as above. Says > requesting 1 container then goes to stopping) > 4:35:23.324 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - > Starting kafka server > 14:35:23.684 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - > Kafka server started > 14:35:23.694 [ServiceDelegate] INFO o.a.t.internal.ZKServiceDecorator - > Running: 3b4a1312-d7dd-42d1-9733-0c8a87155a52 > 14:35:23.695 [main] INFO o.apache.twill.internal.ServiceMain - Service > org.apache.twill.internal.appmaster.ApplicationMasterService@4e4d2176 started. > 14:35:23.752 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - > Request 1 container with capability <memory:512, vCores:1> > 14:35:24.777 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - > Stop application master with spec: > {"name":"DistributedShell","runnables":{"DistributedShell":{"name":"DistributedShell","runnable":{"classname":"net.skytree.yarn.app.DistributedShell","name":"DistributedShell","arguments":{"cmds":"pwd;ls > > -al"}},"resources":{"cores":1,"memorySize":512,"instances":1,"uplink":-1,"downlink":-1,"hosts":[],"racks":[]},"files":[]}},"orders":[{"names":["DistributedShell"],"type":"STARTED"}],"handler":{"classname":"org.apache.twill.internal.LogOnlyEventHandler","configs":{}}} > 14:35:24.778 [Thread-4] INFO o.a.t.internal.ZKServiceDecorator - Stopping: > 3b4a1312-d7dd-42d1-9733-0c8a87155a52 > 14:35:24.782 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - > Stopping application master tracker server > Cleanup directory tmp/twill.launcher-1395426920855-0 > > > > On Mar 20, 2014, at 6:06 PM, safder <[email protected]> wrote: > >> Hi Guys, >> >> Needed help with Twill. I am trying to run a simple Distributed Shell >> application on a single node cluster. When I run it, in the standard out >> logs I get a ton of kafka related errors. I tee'ed the logs, but each run >> was making 25MBs of it. The only main exception I see is this >> >> >> 20:57:42.382 [YarnTwillRunnerService >> STARTING-SendThread(localhost.localdomain:2181)] DEBUG >> org.apache.zookeeper.ClientCnxn - Reading >> reply sessionid:0x144e1a859d40052, packet:: >> clientPath:/MY_BASE_APP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state >> serverPath:/MY_BASE_A >> PP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state finished:false header:: 15,4 >> replyHeader:: 15,652,0 request:: '/MY_BASE_APP/c47fd263- >> a5c1-48ef-8c76-a91cf8009431/state,T response:: >> #7b227374617465223a2253544f5050494e47227d,s{627,652,1395363459875,1395363462375,3,0,0 >> ,0,20,0,627} >> 20:57:42.639 [Kafka-Consumer-log-0] INFO >> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on >> TopicPartition{to >> pic=log, partition=0}. >> java.net.ConnectException: Connection refused >> at sun.nio.ch.Net.connect0(Native Method) ~[na:1.7.0_45] >> at sun.nio.ch.Net.connect(Net.java:465) ~[na:1.7.0_45] >> at sun.nio.ch.Net.connect(Net.java:457) ~[na:1.7.0_45] >> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:666) >> ~[na:1.7.0_45] >> at kafka.network.BlockingChannel.connect(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.consumer.SimpleConsumer.connect(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.consumer.SimpleConsumer.reconnect(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.consumer.SimpleConsumer.liftedTree1$1(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at >> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(Unknown >> Source) ~[kafka_2.10-0.8.0.jar:0.8.0] >> at >> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Unknown >> Source) ~[kafka_2.10-0.8.0.jar:0.8.0] >> at >> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown >> Source) ~[kafka_2.10-0.8.0.jar:0.8.0] >> at >> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown >> Source) ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.metrics.KafkaTimer.time(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at >> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown >> Source) ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown >> Source) ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.metrics.KafkaTimer.time(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.consumer.SimpleConsumer.fetch(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at kafka.javaapi.consumer.SimpleConsumer.fetch(Unknown Source) >> ~[kafka_2.10-0.8.0.jar:0.8.0] >> at >> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.fetchMessages(SimpleKafkaConsumer.java:419) >> ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT] >> at >> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.run(SimpleKafkaConsumer.java:355) >> ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT] >> 20:57:42.642 [Kafka-Consumer-log-0] INFO >> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on >> TopicPartition{topic=log, partition=0}. >> java.net.ConnectException: Connection refused >> >> >> I also attached the application logs on the yarn end. That is showing a >> different exception. >> >> [main] ERROR o.apache.twill.internal.ServiceMain - Exception when starting >> service >> org.apache.twill.internal.appmaster.ApplicationMasterService@1d16eaf2. >> java.util.concurrent.ExecutionException: >> java.util.concurrent.ExecutionException: >> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = >> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state >> at >> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:294) >> ~[guava-13.0.1.jar:na] >> at >> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:281) >> ~[guava-13.0.1.jar:na] >> at >> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116) >> ~[guava-13.0.1.jar:na] >> at org.apache.twill.internal.ServiceMain.doMain(ServiceMain.java:80) >> ~[twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT] >> at >> org.apache.twill.internal.appmaster.ApplicationMasterMain.main(ApplicationMasterMain.java:69) >> [twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT] >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> ~[na:1.7.0_45] >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> ~[na:1.7.0_45] >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> ~[na:1.7.0_45] >> at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_45] >> at >> org.apache.twill.launcher.TwillLauncher.main(TwillLauncher.java:86) >> [launcher.71cb0f5e-fc14-43e7-8149-71e57defd89f.jar:na] >> java.util.concurrent.ExecutionException: >> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = >> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state >> >> >> >> Please help! >> >> Safder >> >> >> <yarn.log> >
