Hi Safder,

Seems like you are not subscribed to the twill-dev mailing list. By
hitting reply on the mail sent to dev@ would reply to dev@ only. You
can subscribe by sending an email to
[email protected].

For the problem that you see, It's caused by NodeManager killing the
container because of virtual memory usage is too high (not heap
memory). You can easily fix that by setting a higher ratio for the
"yarn.nodemanager.vmem-pmem-ratio" property (say 5.1) or simply just
turn the check off if it is desired (set
"yarn.nodemanager.vmem-check-enabled" to false). Both settings can be
done in yarn-site.xml.

Terence

On Fri, Mar 21, 2014 at 11:48 AM, safder <[email protected]> wrote:
> Hi Terrence,
>
> Don't know exactly why I don't get your emails.
>
> After looking at your reply, I was going over all the logs. I don't see much. 
> The first attempt for application Master just fails with exit code 143
>
> Here are some logs for it.
>
>
> 014-03-21 14:35:24,768 WARN  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(435)) - Container 
> [pid=5409,containerID=container_1395355306546_0023_01_000001] is running 
> beyond virtual memory limits. Current usage: 173.9 MB of 512 MB physical 
> memory used; 1.1 GB of 1.0 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1395355306546_0023_01_000001 :
>         |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) 
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>         |- 5419 5409 5409 5409 (java) 533 27 1123053568 44208 java 
> -Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023 
> -Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m 
> org.apache.twill.launcher.TwillLauncher appMaster.jar 
> org.apache.twill.internal.appmaster.ApplicationMasterMain false
>         |- 5409 25677 5409 5409 (bash) 0 0 108650496 307 /bin/bash -c java 
> -Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023 
> -Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m  
> org.apache.twill.launcher.TwillLauncher appMaster.jar 
> org.apache.twill.internal.appmaster.ApplicationMasterMain false 
> 1>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stdout
>  
> 2>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stderr
>
> 2014-03-21 14:35:24,768 INFO  monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:run(445)) - Removed ProcessTree with root 5409
> 2014-03-21 14:35:24,768 INFO  container.Container 
> (ContainerImpl.java:handle(871)) - Container 
> container_1395355306546_0023_01_000001 transitioned from RUNNING to KILLING
> 2014-03-21 14:35:24,769 INFO  launcher.ContainerLaunch 
> (ContainerLaunch.java:cleanupContainer(341)) - Cleaning up container 
> container_1395355306546_0023_01_000001
> 2014-03-21 14:35:24,773 WARN  nodemanager.DefaultContainerExecutor 
> (DefaultContainerExecutor.java:launchContainer(207)) - Exit code from 
> container container_1395355306546_0023_01_000001 is : 143
> 2014-03-21 14:35:24,796 INFO  container.Container 
> (ContainerImpl.java:handle(871)) - Container 
> container_1395355306546_0023_01_000001 transitioned from KILLING to 
> CONTAINER_CLEANEDUP_AFTER_KILL
>
>
>
> I don't see much in either the application logs or the logs generated on the 
> client. Could you also direct me if I should be looking at any particular log?
>
> I am looking at
> 1: resourcemanger logs (says first attempt exited with exit code 143 as above)
> 2: nodemanger logs (says first attempt exited with exit code 143 as above)
> 3: twill client logs (just goes from starting to killed)
> 4: yarn application logs (yarn logs -applicationId) (same as above. Says 
> requesting 1 container then goes to stopping)
> 4:35:23.324 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
> Starting kafka server
> 14:35:23.684 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
> Kafka server started
> 14:35:23.694 [ServiceDelegate] INFO  o.a.t.internal.ZKServiceDecorator - 
> Running: 3b4a1312-d7dd-42d1-9733-0c8a87155a52
> 14:35:23.695 [main] INFO  o.apache.twill.internal.ServiceMain - Service 
> org.apache.twill.internal.appmaster.ApplicationMasterService@4e4d2176 started.
> 14:35:23.752 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
> Request 1 container with capability <memory:512, vCores:1>
> 14:35:24.777 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
> Stop application master with spec: 
> {"name":"DistributedShell","runnables":{"DistributedShell":{"name":"DistributedShell","runnable":{"classname":"net.skytree.yarn.app.DistributedShell","name":"DistributedShell","arguments":{"cmds":"pwd;ls
>  
> -al"}},"resources":{"cores":1,"memorySize":512,"instances":1,"uplink":-1,"downlink":-1,"hosts":[],"racks":[]},"files":[]}},"orders":[{"names":["DistributedShell"],"type":"STARTED"}],"handler":{"classname":"org.apache.twill.internal.LogOnlyEventHandler","configs":{}}}
> 14:35:24.778 [Thread-4] INFO  o.a.t.internal.ZKServiceDecorator - Stopping: 
> 3b4a1312-d7dd-42d1-9733-0c8a87155a52
> 14:35:24.782 [ServiceDelegate] INFO  o.a.t.i.a.ApplicationMasterService - 
> Stopping application master tracker server
> Cleanup directory tmp/twill.launcher-1395426920855-0
>
>
>
> On Mar 20, 2014, at 6:06 PM, safder <[email protected]> wrote:
>
>> Hi Guys,
>>
>> Needed help with Twill. I am trying to run a simple Distributed Shell 
>> application on a single node cluster. When I run it, in the standard out 
>> logs I get a ton of kafka related errors. I tee'ed the logs, but each run 
>> was making 25MBs of it. The only main exception I see is this
>>
>>
>> 20:57:42.382 [YarnTwillRunnerService 
>> STARTING-SendThread(localhost.localdomain:2181)] DEBUG 
>> org.apache.zookeeper.ClientCnxn - Reading
>> reply sessionid:0x144e1a859d40052, packet:: 
>> clientPath:/MY_BASE_APP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state 
>> serverPath:/MY_BASE_A
>> PP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state finished:false header:: 15,4  
>> replyHeader:: 15,652,0  request:: '/MY_BASE_APP/c47fd263-
>> a5c1-48ef-8c76-a91cf8009431/state,T  response:: 
>> #7b227374617465223a2253544f5050494e47227d,s{627,652,1395363459875,1395363462375,3,0,0
>> ,0,20,0,627}
>> 20:57:42.639 [Kafka-Consumer-log-0] INFO  
>> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on 
>> TopicPartition{to
>> pic=log, partition=0}.
>> java.net.ConnectException: Connection refused
>>        at sun.nio.ch.Net.connect0(Native Method) ~[na:1.7.0_45]
>>        at sun.nio.ch.Net.connect(Net.java:465) ~[na:1.7.0_45]
>>        at sun.nio.ch.Net.connect(Net.java:457) ~[na:1.7.0_45]
>>        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:666) 
>> ~[na:1.7.0_45]
>>        at kafka.network.BlockingChannel.connect(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.consumer.SimpleConsumer.connect(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.consumer.SimpleConsumer.reconnect(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.consumer.SimpleConsumer.liftedTree1$1(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at 
>> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(Unknown
>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Unknown
>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown
>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown
>>  Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.metrics.KafkaTimer.time(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at 
>> kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown 
>> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown 
>> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.metrics.KafkaTimer.time(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.consumer.SimpleConsumer.fetch(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at kafka.javaapi.consumer.SimpleConsumer.fetch(Unknown Source) 
>> ~[kafka_2.10-0.8.0.jar:0.8.0]
>>        at 
>> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.fetchMessages(SimpleKafkaConsumer.java:419)
>>  ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>        at 
>> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.run(SimpleKafkaConsumer.java:355)
>>  ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>> 20:57:42.642 [Kafka-Consumer-log-0] INFO  
>> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on 
>> TopicPartition{topic=log, partition=0}.
>> java.net.ConnectException: Connection refused
>>
>>
>> I also attached the application logs on the yarn end. That is showing a 
>> different exception.
>>
>> [main] ERROR o.apache.twill.internal.ServiceMain - Exception when starting 
>> service 
>> org.apache.twill.internal.appmaster.ApplicationMasterService@1d16eaf2.
>> java.util.concurrent.ExecutionException: 
>> java.util.concurrent.ExecutionException: 
>> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
>> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state
>>        at 
>> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:294)
>>  ~[guava-13.0.1.jar:na]
>>        at 
>> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:281)
>>  ~[guava-13.0.1.jar:na]
>>        at 
>> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
>>  ~[guava-13.0.1.jar:na]
>>        at org.apache.twill.internal.ServiceMain.doMain(ServiceMain.java:80) 
>> ~[twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>        at 
>> org.apache.twill.internal.appmaster.ApplicationMasterMain.main(ApplicationMasterMain.java:69)
>>  [twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
>> ~[na:1.7.0_45]
>>        at 
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>  ~[na:1.7.0_45]
>>        at 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>  ~[na:1.7.0_45]
>>        at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_45]
>>        at 
>> org.apache.twill.launcher.TwillLauncher.main(TwillLauncher.java:86) 
>> [launcher.71cb0f5e-fc14-43e7-8149-71e57defd89f.jar:na]
>> java.util.concurrent.ExecutionException: 
>> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = 
>> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state
>>
>>
>>
>> Please help!
>>
>> Safder
>>
>>
>> <yarn.log>
>

Reply via email to