Hey Stephen,

I was on vacation last week, I'm looking over the logs this week.  I've got
a few ideas for a first but may take me a while as I get back into work.

Darin

On Fri, Jul 1, 2016 at 2:43 AM, Stephen Gran <stephen.g...@piksel.com>
wrote:

> Hi,
>
> It's not a problem at all.  Anything I can do to help.
>
> I've attached the log file for the relevant time period.  This is hadoop
> 2.7.2 - you have a good memory :)
>
> Cheers,
>
> On 30/06/16 22:56, Darin Johnson wrote:
> > Hey Steven,
> >
> > Looks like this might be slightly different than what I was originally
> > expecting.  Sorry to keep asking for more info but it will help me
> recreate
> > the issue.  Could you possibly get me more of the ResourceManager logs?
> In
> > particular, I'm trying to figure out where upgradeNodeCapacity is getting
> > called from and any transitions of slave2.  Also, what version of hadoop
> > are you running, I think I recall it being 2.72 but should verify.
> >
> > Thanks for taken the time to work with me on this.
> >
> > Darin
> >
> > On Thu, Jun 30, 2016 at 5:10 PM, Stephen Gran <stephen.g...@piksel.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Yes - the imaginatively named slave2 was a zero-sized nm at that point -
> >> I am looking at how small a pool of reserved resource I can get away
> >> with, and use FGS for burst activity.
> >>
> >>
> >> Here are all the logs related to that host:port combination around that
> >> time:
> >>
> >> 2016-06-30 19:47:43,756 INFO
> >> org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
> >> Expired:slave2:24679 Timed out after 2 secs
> >> 2016-06-30 19:47:43,771 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
> >> Deactivating Node slave2:24679 as it is now LOST
> >> 2016-06-30 19:47:43,771 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
> >> slave2:24679 Node Transitioned from RUNNING to LOST
> >> 2016-06-30 19:47:43,909 INFO
> >> org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task
> >> yarn_Container: [ContainerId: container_1467314892573_0009_01_000005,
> >> NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource:
> >> <memory:2048, vCores:1>, Priority: 20, Token: Token { kind:
> >> ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0
> >> cpu and 1 mem.
> >> 2016-06-30 19:47:43,909 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> >> Released container container_1467314892573_0009_01_000005 of capacity
> >> <memory:2048, vCores:1> on host slave2:24679, which currently has 1
> >> containers, <memory:2048, vCores:1> used and <memory:2048, vCores:1>
> >> available, release resources=true
> >> 2016-06-30 19:47:43,909 INFO
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> >> Application attempt appattempt_1467314892573_0009_000001 released
> >> container container_1467314892573_0009_01_000005 on node: host:
> >> slave2:24679 #containers=1 available=<memory:2048, vCores:1>
> >> used=<memory:2048, vCores:1> with event: KILL
> >> 2016-06-30 19:47:43,909 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
> >> Node not found resyncing slave2:24679
> >> 2016-06-30 19:47:43,952 INFO
> >> org.apache.myriad.scheduler.fgs.YarnNodeCapacityManager: Removed task
> >> yarn_Container: [ContainerId: container_1467314892573_0009_01_000006,
> >> NodeId: slave2:24679, NodeHttpAddress: slave2:23177, Resource:
> >> <memory:2048, vCores:1>, Priority: 20, Token: Token { kind:
> >> ContainerToken, service: 10.0.5.5:24679 }, ] with exit status freeing 0
> >> cpu and 1 mem.
> >> 2016-06-30 19:47:43,952 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
> >> Released container container_1467314892573_0009_01_000006 of capacity
> >> <memory:2048, vCores:1> on host slave2:24679, which currently has 0
> >> containers, <memory:0, vCores:0> used and <memory:4096, vCores:2>
> >> available, release resources=true
> >> 2016-06-30 19:47:43,952 INFO
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> >> Application attempt appattempt_1467314892573_0009_000001 released
> >> container container_1467314892573_0009_01_000006 on node: host:
> >> slave2:24679 #containers=0 available=<memory:4096, vCores:2>
> >> used=<memory:0, vCores:0> with event: KILL
> >> 2016-06-30 19:47:43,952 INFO
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> >> Removed node slave2:24679 cluster capacity: <memory:4096, vCores:4>
> >> 2016-06-30 19:47:47,573 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl:
> >> slave2:24679 Node Transitioned from NEW to RUNNING
> >> 2016-06-30 19:47:47,936 INFO
> >> org.apache.hadoop.yarn.server.resourcemanager.ResourceTrackerService:
> >> NodeManager from node slave2(cmPort: 24679 httpPort: 23177) registered
> >> with capability: <memory:0, vCores:0>, assigned nodeId slave2:24679
> >>
> >>
> >> Looks like it did go into LOST for a bit.
> >>
> >> Cheers,
> >>
> >> On 30/06/16 21:36, Darin Johnson wrote:
> >>> Steven, thanks.  I thought I had fixed that but perhaps a regression
> was
> >>> made in another merge.  I'll look into it, can you answer a few
> >> questions?
> >>> Was the node (slave2) a zero sided nodemanager (for fgs)?  In the node
> >>> manager logs had it recently become unhealthy?  I'm pretty concerned
> >> about
> >>> this and will try to get a patch soon.
> >>>
> >>> Thanks,
> >>>
> >>> Darin
> >>> On Jun 30, 2016 3:53 PM, "Stephen Gran" <stephen.g...@piksel.com>
> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Just playing with the 0.2.0 release (congratulations, by the way!)
> >>>>
> >>>> I have seen this twice now, although it is by no means consistent - I
> >>>> will have a dozen successful runs, and then one of these.  This exits
> >>>> the RM, which makes it rather noticable.
> >>>>
> >>>> 2016-06-30 19:47:43,952 INFO
> >>>>
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
> >>>> Removed node slave2:24679 cluster capacity: <memory:4096, vCore
> >>>> s:4>
> >>>> 2016-06-30 19:47:43,953 FATAL
> >>>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error
> in
> >>>> handling event type NODE_RESOURCE_UPDATE to the scheduler
> >>>> java.lang.NullPointerException
> >>>>            at
> >>>>
> >>>>
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:563)
> >>>>            at
> >>>>
> >>>>
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.updateNodeResource(FairScheduler.java:1652)
> >>>>            at
> >>>>
> >>>>
> >>
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1222)
> >>>>            at
> >>>>
> >>>>
> >>
> org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:102)
> >>>>            at
> >>>>
> >>>>
> >>
> org.apache.myriad.scheduler.yarn.MyriadFairScheduler.handle(MyriadFairScheduler.java:42)
> >>>>            at
> >>>>
> >>>>
> >>
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:671)
> >>>>            at java.lang.Thread.run(Thread.java:745)
> >>>> 2016-06-30 19:47:43,972 INFO
> >>>> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
> Exiting,
> >>>> bbye..
> >>>>
> >>>> --
> >>>> Stephen Gran
> >>>> Senior Technical Architect
> >>>>
> >>>> picture the possibilities | piksel.com
> >>>> This message is private and confidential. If you have received this
> >>>> message in error, please notify the sender or serviced...@piksel.com
> >> and
> >>>> remove it from your system.
> >>>>
> >>>> Piksel Inc is a company registered in the United States New York City,
> >>>> 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986
> >>>>
> >>>
> >>
> >> --
> >> Stephen Gran
> >> Senior Technical Architect
> >>
> >> picture the possibilities | piksel.com
> >>
> >
>
> --
> Stephen Gran
> Senior Technical Architect
>
> picture the possibilities | piksel.com
>

Reply via email to