Thanks for digging into this. Mind opening a jira to discuss further? Much
appreciated.

Arun

On Mon, Sep 8, 2014 at 7:15 PM, Anfernee Xu <anfernee...@gmail.com> wrote:

> It turned out that it's not a configuration issue, some worker thread
> which submits job to Yarn was blocked, see below thread dump
>
> "pool-1-thread-160" id=194 idx=0x30c tid=886 prio=5 alive, blocked,
> native_blocked
>     -- Blocked trying to get lock:
> org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin lock]
>     at __lll_lock_wait+36(:0)@0x340260d594
>     at tsSleep+399(threadsystem.c:83)@0x2b2356e5da80
>     at jrockit/vm/Threads.sleep(I)V(Native Method)
>     at jrockit/vm/Locks.waitForThinRelease(Locks.java:955)[optimized]
>     at
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1083)[optimized]
>     at jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.addCall(Client.java:400)[inlined]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2500(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1393)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x10087d788[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x100522fb8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> The lock was held by
>
> "pool-1-thread-10" id=44 idx=0xb4 tid=736 prio=5 alive, sleeping,
> native_waiting
>     at pthread_cond_timedwait@@GLIBC_2.3.2+288(:0)@0x340260b1c0
>     at eventTimedWaitNoTransitionImpl+46(event.c:93)@0x2b2356cc741f
>     at
> syncWaitForSignalNoTransition+133(synchronization.c:51)@0x2b2356e5a096
>     at syncWaitForSignal+189(synchronization.c:85)@0x2b2356e5a1ae
>     at vmtSleep+165(signaling.c:197)@0x2b2356e35ef6
>     at JVM_Sleep+188(jvmthreads.c:119)@0x2b2356d6bb7d
>     at java/lang/Thread.sleep(J)V(Native Method)
>     at
> org/apache/hadoop/ipc/Client$Connection.handleConnectionFailure(Client.java:778)[optimized]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupConnection(Client.java:566)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60
> [recursive]
>     at
> org/apache/hadoop/ipc/Client$Connection.setupIOstreams(Client.java:642)[optimized]
>     ^-- Holding lock: org/apache/hadoop/ipc/Client$Connection@0x1059d0c60[thin
> lock]
>     at
> org/apache/hadoop/ipc/Client$Connection.access$2600(Client.java:314)[inlined]
>     at
> org/apache/hadoop/ipc/Client.getConnection(Client.java:1399)[optimized]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1318)[inlined]
>     at org/apache/hadoop/ipc/Client.call(Client.java:1300)[inlined]
>     at
> org/apache/hadoop/ipc/ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)[optimized]
>     at
> $Proxy21.getJobReport(Lcom/google/protobuf/RpcController;Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportRequestProto;)Lorg/apache/hadoop/mapreduce/v2/proto/MRServiceProtos$GetJobReportResponseProto;(Unknown
> Source)
>     at
> org/apache/hadoop/mapreduce/v2/api/impl/pb/client/MRClientProtocolPBClientImpl.getJobReport(MRClientProtocolPBClientImpl.java:133)[optimized]
>     at
> sun/reflect/GeneratedMethodAccessor79.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object;(Unknown
> Source)[optimized]
>     at
> sun/reflect/DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)[optimized]
>     at java/lang/reflect/Method.invoke(Method.java:597)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.invoke(ClientServiceDelegate.java:317)[inlined]
>     at
> org/apache/hadoop/mapred/ClientServiceDelegate.getJobStatus(ClientServiceDelegate.java:416)[optimized]
>     ^-- Holding lock:
> org/apache/hadoop/mapred/ClientServiceDelegate@0x1012c34f8[biased lock]
>     at
> org/apache/hadoop/mapred/YarnRunner.getJobStatus(TIEYarnRunner.java:522)[optimized]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:314)[inlined]
>     at org/apache/hadoop/mapreduce/Job$1.run(Job.java:311)[inlined]
>     at
> jrockit/vm/AccessController.doPrivileged(AccessController.java:254)[inlined]
>     at javax/security/auth/Subject.doAs(Subject.java:396)[inlined]
>     at
> org/apache/hadoop/security/UserGroupInformation.doAs(UserGroupInformation.java:1491)[inlined]
>     at
> org/apache/hadoop/mapreduce/Job.updateStatus(Job.java:311)[optimized]
>     ^-- Holding lock: org/apache/hadoop/mapreduce/Job@0x1016e05a8[biased
> lock]
>     at org/apache/hadoop/mapreduce/Job.isComplete(Job.java:599)
>     at org/apache/hadoop/mapreduce/Job.waitForCompletion(Job.java:1294)
>
> You can see the thead holding the lock is in sleep state and the calling
> method is Connection.handleConnectionFailure(), so I checked the our log
> file and realized the connection failure is about historyserver is not
> available. In my case, I did not start historyserver at all, because it's
> not needed(I disabled log-aggregation), so my question is why the job
> client was still trying to talk to historyserver even log aggregation is
> disabled.
>
> Thanks
>
>
>
> On Mon, Sep 8, 2014 at 3:57 AM, Arun Murthy <a...@hortonworks.com> wrote:
>
>> How many nodes do you have in your cluster?
>>
>> Also, could you share the CapacityScheduler initialization logs for each
>> queue, such as:
>>
>> 2014-08-14 15:14:23,835 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
>> Initialized queue: unfunded: capacity=0.5, absoluteCapacity=0.5,
>> usedResources=<memory:0, vCores:0>, usedCapacity=0.0,
>> absoluteUsedCapacity=0.0, numApps=0, numContainers=0
>> 2014-08-14 15:14:23,840 INFO
>> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
>> Initializing default
>> capacity = 0.5 [= (float) configuredCapacity / 100 ]
>> asboluteCapacity = 0.5 [= parentAbsoluteCapacity * capacity ]
>> maxCapacity = 1.0 [= configuredMaxCapacity ]
>> absoluteMaxCapacity = 1.0 [= 1.0 maximumCapacity undefined,
>> (parentAbsoluteMaxCapacity * maximumCapacity) / 100 otherwise ]
>> userLimit = 100 [= configuredUserLimit ]
>> userLimitFactor = 1.0 [= configuredUserLimitFactor ]
>> maxApplications = 5000 [= configuredMaximumSystemApplicationsPerQueue or
>> (int)(configuredMaximumSystemApplications * absoluteCapacity)]
>> maxApplicationsPerUser = 5000 [= (int)(maxApplications * (userLimit /
>> 100.0f) * userLimitFactor) ]
>> maxActiveApplications = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) * maxAMResourcePerQueuePercent * absoluteMaxCapacity),1)
>> ]
>> maxActiveAppsUsingAbsCap = 1 [= max((int)ceil((clusterResourceMemory /
>> minimumAllocation) *maxAMResourcePercent * absoluteCapacity),1) ]
>> maxActiveApplicationsPerUser = 1 [= max((int)(maxActiveApplications *
>> (userLimit / 100.0f) * userLimitFactor),1) ]
>> usedCapacity = 0.0 [= usedResourcesMemory / (clusterResourceMemory *
>> absoluteCapacity)]
>> absoluteUsedCapacity = 0.0 [= usedResourcesMemory / clusterResourceMemory]
>> maxAMResourcePerQueuePercent = 0.1 [= configuredMaximumAMResourcePercent ]
>> minimumAllocationFactor = 0.87506104 [= (float)(maximumAllocationMemory -
>> minimumAllocationMemory) / maximumAllocationMemory ]
>> numContainers = 0 [= currentNumContainers ]
>> state = RUNNING [= configuredState ]
>> acls = SUBMIT_APPLICATIONS: ADMINISTER_QUEUE:  [= configuredAcls ]
>> nodeLocalityDelay = 0
>>
>>
>> Then, look at values for maxActiveAppsUsingAbsCap &
>> maxActiveApplicationsPerUser. That should help debugging.
>>
>> thanks,
>> Arun
>>
>>
>> On Sun, Sep 7, 2014 at 9:37 AM, Anfernee Xu <anfernee...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I'rm running my cluster at Hadoop 2.2.0,  and use CapacityScheduler. And
>>> all my jobs are uberized and running among 2 queues, one queue takes
>>> majority of capacity(90%), another take 10%. What I found is for small
>>> queue, only one job is running for a given time, I tried twisting below
>>> properties, but no luck so far, could you guys share some light on this?
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>     <description>
>>>       Maximum percent of resources in the cluster which can be used to
>>> run
>>>       application masters i.e. controls number of concurrent running
>>>       applications.
>>>     </description>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.queues</name>
>>>     <value>default,small</value>
>>>     <description>
>>>       The queues at the this level (root is the root queue).
>>>     </description>
>>>   </property>
>>>
>>>  <property>
>>>
>>> <name>yarn.scheduler.capacity.root.small.maximum-am-resource-percent</name>
>>>     <value>1.0</value>
>>>   </property>
>>>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.small.user-limit</name>
>>>     <value>1</value>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.capacity</name>
>>>     <value>88</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.capacity</name>
>>>     <value>12</value>
>>>     <description>Default queue target capacity.</description>
>>>   </property>
>>>
>>>  <property>
>>>     <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
>>>     <value>88</value>
>>>     <description>
>>>       The maximum capacity of the default queue.
>>>     </description>
>>>   </property>
>>>
>>>   <property>
>>>     <name>yarn.scheduler.capacity.root.small.maximum-capacity</name>
>>>     <value>12</value>
>>>     <description>Maximum queue capacity.</description>
>>>   </property>
>>>
>>>
>>> Thanks
>>>
>>> --
>>> --Anfernee
>>>
>>
>>
>>
>> --
>>
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>>
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity
>> to which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>
>
>
>
> --
> --Anfernee
>



-- 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to