Hi Terrence,
Don’t know exactly why I don’t get your emails.
After looking at your reply, I was going over all the logs. I don’t see much.
The first attempt for application Master just fails with exit code 143
Here are some logs for it.
014-03-21 14:35:24,768 WARN monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(435)) - Container
[pid=5409,containerID=container_1395355306546_0023_01_000001] is running beyond
virtual memory limits. Current usage: 173.9 MB of 512 MB physical memory used;
1.1 GB of 1.0 GB virtual memory used. Killing container.
Dump of the process-tree for container_1395355306546_0023_01_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 5419 5409 5409 5409 (java) 533 27 1123053568 44208 java
-Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023
-Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m
org.apache.twill.launcher.TwillLauncher appMaster.jar
org.apache.twill.internal.appmaster.ApplicationMasterMain false
|- 5409 25677 5409 5409 (bash) 0 0 108650496 307 /bin/bash -c java
-Djava.io.tmpdir=tmp -Dyarn.appId=application_1395355306546_0023
-Dtwill.app=DistributedShell -cp launcher.jar:/etc/hadoop/ -Xmx362m
org.apache.twill.launcher.TwillLauncher appMaster.jar
org.apache.twill.internal.appmaster.ApplicationMasterMain false
1>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stdout
2>/gird/hadoop/hdfs/yarn/logs/application_1395355306546_0023/container_1395355306546_0023_01_000001/stderr
2014-03-21 14:35:24,768 INFO monitor.ContainersMonitorImpl
(ContainersMonitorImpl.java:run(445)) - Removed ProcessTree with root 5409
2014-03-21 14:35:24,768 INFO container.Container
(ContainerImpl.java:handle(871)) - Container
container_1395355306546_0023_01_000001 transitioned from RUNNING to KILLING
2014-03-21 14:35:24,769 INFO launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(341)) - Cleaning up container
container_1395355306546_0023_01_000001
2014-03-21 14:35:24,773 WARN nodemanager.DefaultContainerExecutor
(DefaultContainerExecutor.java:launchContainer(207)) - Exit code from container
container_1395355306546_0023_01_000001 is : 143
2014-03-21 14:35:24,796 INFO container.Container
(ContainerImpl.java:handle(871)) - Container
container_1395355306546_0023_01_000001 transitioned from KILLING to
CONTAINER_CLEANEDUP_AFTER_KILL
I don’t see much in either the application logs or the logs generated on the
client. Could you also direct me if I should be looking at any particular log?
I am looking at
1: resourcemanger logs (says first attempt exited with exit code 143 as above)
2: nodemanger logs (says first attempt exited with exit code 143 as above)
3: twill client logs (just goes from starting to killed)
4: yarn application logs (yarn logs -applicationId) (same as above. Says
requesting 1 container then goes to stopping)
4:35:23.324 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService -
Starting kafka server
14:35:23.684 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - Kafka
server started
14:35:23.694 [ServiceDelegate] INFO o.a.t.internal.ZKServiceDecorator -
Running: 3b4a1312-d7dd-42d1-9733-0c8a87155a52
14:35:23.695 [main] INFO o.apache.twill.internal.ServiceMain - Service
org.apache.twill.internal.appmaster.ApplicationMasterService@4e4d2176 started.
14:35:23.752 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService -
Request 1 container with capability <memory:512, vCores:1>
14:35:24.777 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService - Stop
application master with spec:
{"name":"DistributedShell","runnables":{"DistributedShell":{"name":"DistributedShell","runnable":{"classname":"net.skytree.yarn.app.DistributedShell","name":"DistributedShell","arguments":{"cmds":"pwd;ls
-al"}},"resources":{"cores":1,"memorySize":512,"instances":1,"uplink":-1,"downlink":-1,"hosts":[],"racks":[]},"files":[]}},"orders":[{"names":["DistributedShell"],"type":"STARTED"}],"handler":{"classname":"org.apache.twill.internal.LogOnlyEventHandler","configs":{}}}
14:35:24.778 [Thread-4] INFO o.a.t.internal.ZKServiceDecorator - Stopping:
3b4a1312-d7dd-42d1-9733-0c8a87155a52
14:35:24.782 [ServiceDelegate] INFO o.a.t.i.a.ApplicationMasterService -
Stopping application master tracker server
Cleanup directory tmp/twill.launcher-1395426920855-0
On Mar 20, 2014, at 6:06 PM, safder <[email protected]> wrote:
> Hi Guys,
>
> Needed help with Twill. I am trying to run a simple Distributed Shell
> application on a single node cluster. When I run it, in the standard out logs
> I get a ton of kafka related errors. I tee’ed the logs, but each run was
> making 25MBs of it. The only main exception I see is this
>
>
> 20:57:42.382 [YarnTwillRunnerService
> STARTING-SendThread(localhost.localdomain:2181)] DEBUG
> org.apache.zookeeper.ClientCnxn - Reading
> reply sessionid:0x144e1a859d40052, packet::
> clientPath:/MY_BASE_APP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state
> serverPath:/MY_BASE_A
> PP/c47fd263-a5c1-48ef-8c76-a91cf8009431/state finished:false header:: 15,4
> replyHeader:: 15,652,0 request:: '/MY_BASE_APP/c47fd263-
> a5c1-48ef-8c76-a91cf8009431/state,T response::
> #7b227374617465223a2253544f5050494e47227d,s{627,652,1395363459875,1395363462375,3,0,0
> ,0,20,0,627}
> 20:57:42.639 [Kafka-Consumer-log-0] INFO
> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on
> TopicPartition{to
> pic=log, partition=0}.
> java.net.ConnectException: Connection refused
> at sun.nio.ch.Net.connect0(Native Method) ~[na:1.7.0_45]
> at sun.nio.ch.Net.connect(Net.java:465) ~[na:1.7.0_45]
> at sun.nio.ch.Net.connect(Net.java:457) ~[na:1.7.0_45]
> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:666)
> ~[na:1.7.0_45]
> at kafka.network.BlockingChannel.connect(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer.connect(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer.reconnect(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer.liftedTree1$1(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at
> kafka.consumer.SimpleConsumer.kafka$consumer$SimpleConsumer$$sendRequest(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at
> kafka.consumer.SimpleConsumer$$anonfun$fetch$1$$anonfun$apply$mcV$sp$1.apply(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.metrics.KafkaTimer.time(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply$mcV$sp(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer$$anonfun$fetch$1.apply(Unknown
> Source) ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.metrics.KafkaTimer.time(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.consumer.SimpleConsumer.fetch(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at kafka.javaapi.consumer.SimpleConsumer.fetch(Unknown Source)
> ~[kafka_2.10-0.8.0.jar:0.8.0]
> at
> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.fetchMessages(SimpleKafkaConsumer.java:419)
> ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
> at
> org.apache.twill.internal.kafka.client.SimpleKafkaConsumer$ConsumerThread.run(SimpleKafkaConsumer.java:355)
> ~[twill-core-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
> 20:57:42.642 [Kafka-Consumer-log-0] INFO
> o.a.t.i.k.client.SimpleKafkaConsumer - Exception when fetching message on
> TopicPartition{topic=log, partition=0}.
> java.net.ConnectException: Connection refused
>
>
> I also attached the application logs on the yarn end. That is showing a
> different exception.
>
> [main] ERROR o.apache.twill.internal.ServiceMain - Exception when starting
> service org.apache.twill.internal.appmaster.ApplicationMasterService@1d16eaf2.
> java.util.concurrent.ExecutionException:
> java.util.concurrent.ExecutionException:
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode =
> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state
> at
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:294)
> ~[guava-13.0.1.jar:na]
> at
> com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:281)
> ~[guava-13.0.1.jar:na]
> at
> com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
> ~[guava-13.0.1.jar:na]
> at org.apache.twill.internal.ServiceMain.doMain(ServiceMain.java:80)
> ~[twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
> at
> org.apache.twill.internal.appmaster.ApplicationMasterMain.main(ApplicationMasterMain.java:69)
> [twill-yarn-0.2.0-incubating-SNAPSHOT.jar:0.2.0-incubating-SNAPSHOT]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[na:1.7.0_45]
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> ~[na:1.7.0_45]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:1.7.0_45]
> at java.lang.reflect.Method.invoke(Method.java:606) ~[na:1.7.0_45]
> at org.apache.twill.launcher.TwillLauncher.main(TwillLauncher.java:86)
> [launcher.71cb0f5e-fc14-43e7-8149-71e57defd89f.jar:na]
> java.util.concurrent.ExecutionException:
> org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode =
> NodeExists for /c47fd263-a5c1-48ef-8c76-a91cf8009431/state
>
>
>
> Please help!
>
> Safder
>
>
> <yarn.log>