Re: Presto on Yarn using Slider

Billie Rinaldi Mon, 31 Jul 2017 09:58:43 -0700

It looks like it's not very obvious from the INFO log, so maybe changing
the RM to DEBUG would help. Below is what my RM log looks like between an
app transitioning to RUNNING and having its first container allocated, and
I don't see a specific log line about a request being received. Still,
being able to start an app the first time and not being able to restart it
is strange. I would check whether the app processes are actually getting
stopped (maybe if they are still using resources for some reason, that
would interfere with new allocations on the same hosts). Another thing I am
wondering is whether the host names that Slider is using for requests match
the host names the RM is using for NMs. If Slider requested a host that the
RM didn't know about, that could cause containers not to be allocated.


2017-07-28 15:43:36,219 INFO rmapp.RMAppImpl:
application_1501255339386_0002 State change from ACCEPTED to RUNNING on
event = ATTEMPT_REGISTERED
2017-07-28 15:43:36,865 INFO allocator.AbstractContainerAllocator:
assignedContainer application attempt=appattempt_1501255339386_0002_000001
container=null
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@4b10a7b0
clusterResource=<memory:8192, vCores:8> type=OFF_SWITCH requestedPartition=
2017-07-28 15:43:36,866 INFO capacity.ParentQueue: assignedContainer
queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125 used=<memory:1024,
vCores:1> cluster=<memory:8192, vCores:8>
2017-07-28 15:43:36,866 INFO rmcontainer.RMContainerImpl:
container_e02_1501255339386_0002_01_000002 Container Transitioned from NEW
to ALLOCATED


On Mon, Jul 31, 2017 at 9:18 AM, Frank Kemmer <frank.kem...@1und1.de> wrote:

> Hi Billie,
>
> do you have any ideas, how I can find out, why the yarn resource manager
> is denying the requests … do I need to set the yarn log to debug mode to
> see such requests? In my opinion a denied request should result in a
> Warning
> or at least in an info … in the yarn log I can see some denied
> reservations,
> but never for requests from the slider AM …
>
> I will ask our IT to set the debug flag for yarn and try to find out why
> yarn says
> no to the placement. My feeling is, that yarn does not allow placements on
> hosts … but then I am really lost … ;)
>
> > Am 31.07.2017 um 17:59 schrieb Billie Rinaldi <billie.rina...@gmail.com
> >:
> >
> > Hi Frank,
> >
> >> My understanding here: Our yarn cannot satisfy anti-affinity ... despite
> > cloudera saying they made backports from even hadoop 3.0.0 ...
> > Anti-affinity is not implemented in YARN at this time; it's implemented
> > entirely in Slider using its role history, so this should not be an
> issue.
> >
> >> they are all using a common path on the host
> > This may be a problem. Slider is not designed to support apps that store
> > data locally; app components might be brought up on any server. (See
> > http://slider.incubator.apache.org/docs/slider_specs/
> application_needs.html
> > for more information on Slider's expectations about apps.) You could get
> > around this somewhat by using strict placement (policy 1), so that
> > components will be restarted on the same nodes they were started
> > previously. But if your component also needs anti-affinity, Slider does
> not
> > have a way to combine anti-affinity and strict placement (first spread
> the
> > components out, then after that only start them on the same nodes where
> > they were started previously). I have speculated whether it would be
> > possible to configure an app for anti-affinity for the first start, then
> > change it to strict, but I have not attempted this.
> >
> >> To me this looks nice, but I never see a placement request at our yarn
> > resource manager node. It looks like slider does decide on its own, that
> it
> > cannot place the requests.
> > This is strange. You should see the container request in the RM. Slider
> > does not decide on its own; the "requests unsatisfiable" message only
> means
> > that a request has been sent to the RM and the RM has not allocated a
> > container.
> >
> >
> > On Fri, Jul 28, 2017 at 7:43 AM, Frank Kemmer <frank.kem...@1und1.de>
> wrote:
> >
> >> Greetings to all ... and sorry for the long post ...
> >>
> >> I am trying to deploy presto on our hadoop cluster via slider. Slider
> >> really looks very
> >> promising for this task but somehow I am experiencing some glitches.
> >>
> >> I am using the following components:
> >>
> >>   - java version "1.8.0_131" (oracle jdk)
> >>   - Cloudera CDH 5.11.1
> >>   - slider-0.90.0
> >>   - presto-yarn https://github.com/prestodb/presto-yarn
> >>
> >> I can do:
> >>
> >>   1. slider package install --name PRESTO --package
> >> ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip
> >>   2. slider create presto --template ./appConfig.json --resources
> >> ./resources.json
> >>
> >> The last statement succeeds some times in bringing up presto and
> >> everything looks fine,
> >> exports are right and I can access the presto coordinator by the
> exported
> >> host_port.
> >>
> >> Then when I stop the slider cluster and start it again, the slider AM
> >> comes up, and says
> >> that the placement requests are unsatisfiable by the cluster.
> >>
> >> I experimented changing the placement.policy yarn.component.placement.
> >> policy".
> >>
> >> Setting: yarn.component.placement.policy": "1"
> >>
> >> Result: This works only when there is not history file and even then
> only
> >> some times, when
> >> the requested containers are on different hosts. Sometimes slider can
> >> place the containers
> >> on different hosts and everything is fine sometimes not, then it fails
> >> with unstable
> >> application, since it tries to start two or more presto components on
> one
> >> host ...
> >>
> >> My understanding here: presto would need anti affinity, even between
> >> the different component types, i.e. role COORDINATOR and WORKER are
> never
> >> allowed to run
> >> on the same host ... they are all using a common path on the host, so
> only
> >> one presto component
> >> can run on one host ... but yarn is not able to guarantee that ...
> >>
> >> Setting: yarn.component.placement.policy": "4"
> >>
> >> Result: The slider AM starts up and says:
> >>
> >>   Diagnostics: 2 anti-affinity components have with requests
> >> unsatisfiable by cluster
> >>
> >> My understanding here: Our yarn cannot satisfy anti-affinity ... despite
> >> cloudera saying
> >> they made backports from even hadoop 3.0.0 ... I don't know, how to
> check
> >> that ...
> >>
> >> Then I had an idea, went back to
> >>
> >> Setting: yarn.component.placement.policy": "1"
> >>
> >> This fails first. Then I edit the history.json file to place each
> >> component on a different host and
> >> hoped that would fix it. But even here the slider AM says, that the
> >> requests are unsatisfiable
> >> by the cluster.
> >>
> >> I switched on debug for slider and yarn and found the following in the
> >> slider.log of the slider AM:
> >>
> >> 2017-07-26 18:20:57,560 [AmExecutor-006] INFO
> appmaster.SliderAppMaster -
> >> Registered service under /users/pentaho/services/org-
> apache-slider/presto;
> >> absolute path /registry/users/pentaho/services/org-apache-slider/presto
> >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor -
> >> Completed org.apache.slider.server.appmaster.actions.
> >> ActionRegisterServiceInstance@35b17c06 name='
> >> ActionRegisterServiceInstance', delay=0, attrs=0, sequenceNumber=5}
> >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor -
> >> Executing org.apache.slider.server.appmaster.actions.
> >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
> >> attrs=4, sequenceNumber=6}
> >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG
> appmaster.SliderAppMaster -
> >> in executeNodeReview(flexCluster)
> >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in
> >> reviewRequestAndReleaseNodes()
> >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
> Reviewing
> >> RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, desired=1,
> >> actual=0, requested=0, releasing=0, failed=0, startFailed=0, started=0,
> >> completed=0, totalRequested=0, preempted=0, nodeFailed=0,
> failedRecently=0,
> >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
> >> isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='COORDINATOR',
> >> group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3,
> >> placementTimeoutSeconds=300, labelExpression='null'},
> failedContainers=[]} :
> >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState - Expected
> >> 1, Delta: 1
> >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
> >> COORDINATOR: Asking for 1 more nodes(s) for a total of 1
> >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory - There
> >> are 1 node(s) to consider for COORDINATOR
> >> 2017-07-26 18:20:57,567 [AmExecutor-006] INFO  state.OutstandingRequest
> -
> >> Submitting request for container on hadoop-worknode04.our.net
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
> Container
> >> ask is Capability[<memory:64512, vCores:1>]Priority[1] and label = null
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
> operations
> >> scheduled: 1; updated role: RoleStatus{name='COORDINATOR',
> >> group=COORDINATOR, key=1, desired=1, actual=0, requested=1, releasing=0,
> >> failed=0, startFailed=0, started=0, completed=0, totalRequested=1,
> >> preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0,
> >> resourceRequirements=<memory:64512, vCores:1>,
> >> isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='COORDINATOR',
> >> group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3,
> >> placementTimeoutSeconds=300, labelExpression='null'},
> failedContainers=[]}
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
> Reviewing
> >> RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, actual=0,
> >> requested=0, releasing=0, failed=0, startFailed=0, started=0,
> completed=0,
> >> totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0,
> >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
> >> isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='WORKER',
> >> group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3,
> >> placementTimeoutSeconds=300, labelExpression='null'},
> failedContainers=[]} :
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - Expected
> >> 1, Delta: 1
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - WORKER:
> >> Asking for 1 more nodes(s) for a total of 1
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory - There
> >> are 4 node(s) to consider for WORKER
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.OutstandingRequest
> -
> >> Submitting request for container on hadoop-worknode01.our.net
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
> Container
> >> ask is Capability[<memory:64512, vCores:1>]Priority[2] and label = null
> >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
> operations
> >> scheduled: 1; updated role: RoleStatus{name='WORKER', group=WORKER,
> key=2,
> >> desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0,
> >> started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0,
> >> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512,
> >> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
> >> providerRole=ProviderRole{name='WORKER', group=WORKER, id=2,
> >> placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300,
> >> labelExpression='null'}, failedContainers=[]}
> >> 2017-07-26 18:20:57,570 [AmExecutor-006] INFO  util.RackResolver -
> >> Resolved hadoop-worknode04.our.net to /default-rack
> >> 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> Added
> >> priority=1
> >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> >> addResourceRequest: applicationId= priority=1 resourceName=hadoop-
> >> worknode04.our.net numContainers=1 #asks=1
> >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> >> addResourceRequest: applicationId= priority=1 resourceName=/default-rack
> >> numContainers=1 #asks=2
> >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> >> addResourceRequest: applicationId= priority=1 resourceName=*
> >> numContainers=1 #asks=3
> >> 2017-07-26 18:20:57,574 [AmExecutor-006] INFO  util.RackResolver -
> >> Resolved hadoop-worknode01.our.net to /default-rack
> >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> Added
> >> priority=2
> >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> >> addResourceRequest: applicationId= priority=2 resourceName=hadoop-
> >> worknode01.our.net numContainers=1 #asks=4
> >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> >> addResourceRequest: applicationId= priority=2 resourceName=/default-rack
> >> numContainers=1 #asks=5
> >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> >> addResourceRequest: applicationId= priority=2 resourceName=*
> >> numContainers=1 #asks=6
> >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG actions.QueueExecutor -
> >> Completed org.apache.slider.server.appmaster.actions.
> >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
> >> attrs=4, sequenceNumber=6}
> >> 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO
> >> state.AppState - app state clusterNodes {slider-appmaster={container_
> >> e26_1500976506429_0025_01_000001=container_e26_
> 1500976506429_0025_01_000001:
> >> 3
> >> state: 3
> >> role: slider-appmaster
> >> host: hadoop-worknode04.our.net
> >> hostURL: http://hadoop-worknode04.our.net:52967
> >> }}
> >>
> >> To me this looks nice, but I never see a placement request at our yarn
> >> resource manager node.
> >> It looks like slider does decide on its own, that it cannot place the
> >> requests. But I cannot
> >> see why it cannot place the requests or why the requests cannot be
> >> satisfied.
> >>
> >> And now I am out of any ideas ... do you have any suggestions how I can
> >> find out why the
> >> slider AM does not place the requests?
> >>
> >> Any ideas are welcome.
> >>
> >> By the way, the cluster has enough resources as the scheduler page
> shows:
> >>
> >>   Instantaneous Fair Share:    <memory:1441792, vCores:224>
> >>
> >> Thanks for any help in advance.
>
>

Re: Presto on Yarn using Slider

Reply via email to