Thank you Terence!

That was very helpful!

This is good stuff… hoping I would be able to add to Twill later on

Safder
On Mar 21, 2014, at 1:18 PM, Terence Yim <[email protected]> wrote:

> Hi Safder,
> 
> Seems like you are not subscribed to the twill-dev mailing list. By
> hitting reply on the mail sent to dev@ would reply to dev@ only. You
> can subscribe by sending an email to
> [email protected].
> 
> For the problem that you see, It's caused by NodeManager killing the
> container because of virtual memory usage is too high (not heap
> memory). You can easily fix that by setting a higher ratio for the
> "yarn.nodemanager.vmem-pmem-ratio" property (say 5.1) or simply just
> turn the check off if it is desired (set
> "yarn.nodemanager.vmem-check-enabled" to false). Both settings can be
> done in yarn-site.xml.
> 
> Terence
> 
> On Fri, Mar 21, 2014 at 11:48 AM, safder <[email protected]> wrote:
>> Hi Terrence,
>> 
>> Don't know exactly why I don't get your emails.
>> 
>> After looking at your reply, I was going over all the logs. I don't see 
>> much. The first attempt for application Master just fails with exit code 143
>> 
>> Here are some logs for it.
>> 
>> 
>> 014-03-21 14:35:24,768 WARN  monitor.ContainersMonitorImpl 
>> (ContainersMonitorImpl.java:run(435)) - Container 
>> [pid=5409,containerID=container_1395355306546_0023_01_000001] is running 
>> beyond virtual memory limits. Current usage: 173.9 MB of 512 MB physical 
>> memory used; 1.1 GB of 1.0 GB virtual memory used. Killing container.
>> Dump of the process-tree for container_1395355306546_0023_01_000001 :
>>        |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>        |- 5419 5409 5409 5409 (java) 533 27 1123053568 44208 java 
>> -Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023 
>> -Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m 
>> org.apache.twill.launcher.TwillLauncher appMaster.jar 
>> org.apache.twill.internal.appmaster.ApplicationMasterMain false
>>        |- 5409 25677 5409 5409 (bash) 0 0 108650496 307 /bin/bash -c java 
>> -Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023 
>> -Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m  
>> org.apache.twill.launcher.TwillLauncher appMaster.jar 
>> org.apache.twill.internal.appmaster.ApplicationMasterMain false 
>> 1>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stdout
>>  
>> 2>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stderr
>> 
>> 2014-03-21 14:35:24,768 INFO  monitor.ContainersMonitorImpl 
>> (ContainersMonitorImpl.java:run(445)) - Removed ProcessTree with root 5409
>> 2014-03-21 14:35:24,768 INFO  container.Container 
>> (ContainerImpl.java:handle(871)) - Container 
>> container_1395355306546_0023_01_000001 transitioned from RUNNING to KILLING
>> 2014-03-21 14:35:24,769 INFO  launcher.ContainerLaunch 
>> (ContainerLaunch.java:cleanupContainer(341)) - Cleaning up container 
>> container_1395355306546_0023_01_000001
>> 2014-03-21 14:35:24,773 WARN  nodemanager.DefaultContainerExecutor 
>> (DefaultContainerExecutor.java:launchContainer(207)) - Exit code from 
>> container container_1395355306546_0023_01_000001 is : 143
>> 2014-03-21 14:35:24,796 INFO  container.Container 
>> (ContainerImpl.java:handle(871)) - Container 
>> container_1395355306546_0023_01_000001 transitioned from KILLING to 
>> CONTAINER_CLEANEDUP_AFTER_KILL
>> 
>> 
>> 
>> I don't see much in either the application logs or the logs generated on the 
>> client. Could you also direct me if I should be looking at any particular 
>> log?
>> 
>> I am looking at
>> 1: resourcemanger logs (says first attempt exited with exit code 143 as 
>> above)
>> 2: nodemanger logs (says first attempt exited with exit code 143 as above)
>> 3: twill client logs (just goes from starting to killed)
>> 4: yarn application logs (yarn logs -applicationId) (same as above. Says 
>> requesting 1 container then goes to stopping)
>> 4:35:23.324 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
>> Starting kafka server
>> 14:35:23.684 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
>> Kafka server started
>> 14:35:23.694 [ServiceDelegate] INFO  o.a.t.internal.ZKServiceDecorator - 
>> Running: 3b4a1312-d7dd-42d1-9733-0c8a87155a52
>> 14:35:23.695 [main] INFO  o.apache.twill.internal.ServiceMain - Service 
>> org.apache.twill.internal.appmaster.ApplicationMasterService@4e4d2176 
>> started.
>> 14:35:23.752 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
>> Request 1 container with capability <memory:512, vCores:1>
>> 14:35:24.777 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
>> Stop application master with spec: 
>> {"name":"DistributedShell","runnables":{"DistributedShell":{"name":"DistributedShell","runnable":{"classname":"net.skytree.yarn.app.DistributedShell","name":"DistributedShell","arguments":{"cmds":"pwd;ls
>>  
>> -al"}},"resources":{"cores":1,"memorySize":512,"instances":1,"uplink":-1,"downlink":-1,"hosts":[],"racks":[]},"files":[]}},"orders":[{"names":["DistributedShell"],"type":"STARTED"}],"handler":{"classname":"org.apache.twill.internal.LogOnlyEventHandler","configs":{}}}
>> 14:35:24.778 [Thread-4] INFO  o.a.t.internal.ZKServiceDecorator - Stopping: 
>> 3b4a1312-d7dd-42d1-9733-0c8a87155a52
>> 14:35:24.782 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
>> Stopping application master tracker server
>> Cleanup directory tmp/twill.launcher-1395426920855-0
>> 
>> 
>> 
>> On Mar 20, 2014, at 6:06 PM, safder <[email protected]> wrote:
>> 
>>> Hi Guys,
>>> 
>>> Needed help with Twill. I am trying to run a simple Distributed Shell 
>>> application on a single node cluster. When I run it, in the standard out 
>>> logs I get a ton of kafka related errors. I tee'ed the logs, but each run 
>>> was making 25MBs of it. The only main exception I see is this
>>> 
>>> 
>>> 20:57:42.382 [YarnTwillRunnerService 
>>> STARTING-SendThread(localhost.localdomain:2181)] DEBUG 
>>> org.apache.zookeeper.ClientCnxn - Reading
>>> reply sessionid:0x144e1a859d40052, packet:: 
>>> clientPath:/MY_BASE_APP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state 
>>> serverPath:/MY_BASE_A
>>> PP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state finished:false header:: 15,4  
>>> replyHeader:: 15,652,0  request:: '/MY_BASE_APP/c47fd263-
>>> a5c1-48ef-8c76-a91cf8009431/state,T  response:: 
>>> #7b227374617465223a2253544f5050494e47227d,s{627,652,1395363459875,1395363462375,3,0,0
>>> ,0,20,0,627}
>>> 20:57:42.639 [Kafka-Consumer-log-0] INFO  
>>> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on 
>>> TopicPartition{to
>>> pic=log, partition=0}.
>>> java.net.ConnectException: Connection refused
>>>       at sun.nio.ch.Net.connect0(Native Method) ~[na:1.7.0_45]
>>>       at sun.nio.ch.Net.connect(Net.java:465) ~[na:1.7.0_45]
>>>       at sun.nio.ch.Net.connect(Net.java:457) ~[na:1.7.0_45]
>>>       at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:666) 
>>> ~[na:1.7.0_45]
>>>       at kafka.network.BlockingChannel.connect(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.consumer.SimpleConsumer.connect(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.consumer.SimpleConsumer.reconnect(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.consumer.SimpleConsumer.liftedTree1$1(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at 
>>> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(Unknown
>>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at 
>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Unknown
>>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at 
>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown
>>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at 
>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown
>>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.metrics.KafkaTimer.time(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at 
>>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown 
>>> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown 
>>> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.metrics.KafkaTimer.time(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.consumer.SimpleConsumer.fetch(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at kafka.javaapi.consumer.SimpleConsumer.fetch(Unknown Source) 
>>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>>       at 
>>> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.fetchMessages(SimpleKafkaConsumer.java:419)
>>>  ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>>       at 
>>> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.run(SimpleKafkaConsumer.java:355)
>>>  ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>> 20:57:42.642 [Kafka-Consumer-log-0] INFO  
>>> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on 
>>> TopicPartition{topic=log, partition=0}.
>>> java.net.ConnectException: Connection refused
>>> 
>>> 
>>> I also attached the application logs on the yarn end. That is showing a 
>>> different exception.
>>> 
>>> [main] ERROR o.apache.twill.internal.ServiceMain - Exception when starting 
>>> service 
>>> org.apache.twill.internal.appmaster.ApplicationMasterService@1d16eaf2.
>>> java.util.concurrent.ExecutionException: 
>>> java.util.concurrent.ExecutionException: 
>>> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
>>> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state
>>>       at 
>>> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:294)
>>>  ~[guava-13.0.1.jar:na]
>>>       at 
>>> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:281)
>>>  ~[guava-13.0.1.jar:na]
>>>       at 
>>> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>>>  ~[guava-13.0.1.jar:na]
>>>       at org.apache.twill.internal.ServiceMain.doMain(ServiceMain.java:80) 
>>> ~[twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>>       at 
>>> org.apache.twill.internal.appmaster.ApplicationMasterMain.main(ApplicationMasterMain.java:69)
>>>  [twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
>>> ~[na:1.7.0_45]
>>>       at 
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>  ~[na:1.7.0_45]
>>>       at 
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>  ~[na:1.7.0_45]
>>>       at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_45]
>>>       at 
>>> org.apache.twill.launcher.TwillLauncher.main(TwillLauncher.java:86) 
>>> [launcher.71cb0f5e-fc14-43e7-8149-71e57defd89f.jar:na]
>>> java.util.concurrent.ExecutionException: 
>>> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
>>> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state
>>> 
>>> 
>>> 
>>> Please help!
>>> 
>>> Safder
>>> 
>>> 
>>> <yarn.log>
>> 

Reply via email to