Hi Frank,

> My understanding here: Our yarn cannot satisfy anti-affinity ... despite
cloudera saying they made backports from even hadoop 3.0.0 ...
Anti-affinity is not implemented in YARN at this time; it's implemented
entirely in Slider using its role history, so this should not be an issue.

> they are all using a common path on the host
This may be a problem. Slider is not designed to support apps that store
data locally; app components might be brought up on any server. (See
http://slider.incubator.apache.org/docs/slider_specs/application_needs.html
for more information on Slider's expectations about apps.) You could get
around this somewhat by using strict placement (policy 1), so that
components will be restarted on the same nodes they were started
previously. But if your component also needs anti-affinity, Slider does not
have a way to combine anti-affinity and strict placement (first spread the
components out, then after that only start them on the same nodes where
they were started previously). I have speculated whether it would be
possible to configure an app for anti-affinity for the first start, then
change it to strict, but I have not attempted this.

>To me this looks nice, but I never see a placement request at our yarn
resource manager node. It looks like slider does decide on its own, that it
cannot place the requests.
This is strange. You should see the container request in the RM. Slider
does not decide on its own; the "requests unsatisfiable" message only means
that a request has been sent to the RM and the RM has not allocated a
container.


On Fri, Jul 28, 2017 at 7:43 AM, Frank Kemmer <frank.kem...@1und1.de> wrote:

> Greetings to all ... and sorry for the long post ...
>
> I am trying to deploy presto on our hadoop cluster via slider. Slider
> really looks very
> promising for this task but somehow I am experiencing some glitches.
>
> I am using the following components:
>
>    - java version "1.8.0_131" (oracle jdk)
>    - Cloudera CDH 5.11.1
>    - slider-0.90.0
>    - presto-yarn https://github.com/prestodb/presto-yarn
>
> I can do:
>
>    1. slider package install --name PRESTO --package
> ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip
>    2. slider create presto --template ./appConfig.json --resources
> ./resources.json
>
> The last statement succeeds some times in bringing up presto and
> everything looks fine,
> exports are right and I can access the presto coordinator by the exported
> host_port.
>
> Then when I stop the slider cluster and start it again, the slider AM
> comes up, and says
> that the placement requests are unsatisfiable by the cluster.
>
> I experimented changing the placement.policy yarn.component.placement.
> policy".
>
> Setting: yarn.component.placement.policy": "1"
>
> Result: This works only when there is not history file and even then only
> some times, when
> the requested containers are on different hosts. Sometimes slider can
> place the containers
> on different hosts and everything is fine sometimes not, then it fails
> with unstable
> application, since it tries to start two or more presto components on one
> host ...
>
> My understanding here: presto would need anti affinity, even between
> the different component types, i.e. role COORDINATOR and WORKER are never
> allowed to run
> on the same host ... they are all using a common path on the host, so only
> one presto component
> can run on one host ... but yarn is not able to guarantee that ...
>
> Setting: yarn.component.placement.policy": "4"
>
> Result: The slider AM starts up and says:
>
>    Diagnostics: 2 anti-affinity components have with requests
> unsatisfiable by cluster
>
> My understanding here: Our yarn cannot satisfy anti-affinity ... despite
> cloudera saying
> they made backports from even hadoop 3.0.0 ... I don't know, how to check
> that ...
>
> Then I had an idea, went back to
>
> Setting: yarn.component.placement.policy": "1"
>
> This fails first. Then I edit the history.json file to place each
> component on a different host and
> hoped that would fix it. But even here the slider AM says, that the
> requests are unsatisfiable
> by the cluster.
>
> I switched on debug for slider and yarn and found the following in the
> slider.log of the slider AM:
>
> 2017-07-26 18:20:57,560 [AmExecutor-006] INFO  appmaster.SliderAppMaster -
> Registered service under /users/pentaho/services/org-apache-slider/presto;
> absolute path /registry/users/pentaho/services/org-apache-slider/presto
> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor -
> Completed org.apache.slider.server.appmaster.actions.
> ActionRegisterServiceInstance@35b17c06 name='
> ActionRegisterServiceInstance', delay=0, attrs=0, sequenceNumber=5}
> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor -
> Executing org.apache.slider.server.appmaster.actions.
> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
> attrs=4, sequenceNumber=6}
> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG appmaster.SliderAppMaster -
> in executeNodeReview(flexCluster)
> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in
> reviewRequestAndReleaseNodes()
> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState - Reviewing
> RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, desired=1,
> actual=0, requested=0, releasing=0, failed=0, startFailed=0, started=0,
> completed=0, totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0,
> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
> isAntiAffinePlacement=false, failureMessage='', 
> providerRole=ProviderRole{name='COORDINATOR',
> group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3,
> placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]} :
> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState - Expected
> 1, Delta: 1
> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
> COORDINATOR: Asking for 1 more nodes(s) for a total of 1
> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory - There
> are 1 node(s) to consider for COORDINATOR
> 2017-07-26 18:20:57,567 [AmExecutor-006] INFO  state.OutstandingRequest -
> Submitting request for container on hadoop-worknode04.our.net
> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - Container
> ask is Capability[<memory:64512, vCores:1>]Priority[1] and label = null
> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - operations
> scheduled: 1; updated role: RoleStatus{name='COORDINATOR',
> group=COORDINATOR, key=1, desired=1, actual=0, requested=1, releasing=0,
> failed=0, startFailed=0, started=0, completed=0, totalRequested=1,
> preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0,
> resourceRequirements=<memory:64512, vCores:1>,
> isAntiAffinePlacement=false, failureMessage='', 
> providerRole=ProviderRole{name='COORDINATOR',
> group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3,
> placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]}
> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - Reviewing
> RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, actual=0,
> requested=0, releasing=0, failed=0, startFailed=0, started=0, completed=0,
> totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0,
> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
> isAntiAffinePlacement=false, failureMessage='', 
> providerRole=ProviderRole{name='WORKER',
> group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3,
> placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]} :
> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - Expected
> 1, Delta: 1
> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - WORKER:
> Asking for 1 more nodes(s) for a total of 1
> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory - There
> are 4 node(s) to consider for WORKER
> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.OutstandingRequest -
> Submitting request for container on hadoop-worknode01.our.net
> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - Container
> ask is Capability[<memory:64512, vCores:1>]Priority[2] and label = null
> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - operations
> scheduled: 1; updated role: RoleStatus{name='WORKER', group=WORKER, key=2,
> desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0,
> started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0,
> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512,
> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
> providerRole=ProviderRole{name='WORKER', group=WORKER, id=2,
> placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300,
> labelExpression='null'}, failedContainers=[]}
> 2017-07-26 18:20:57,570 [AmExecutor-006] INFO  util.RackResolver -
> Resolved hadoop-worknode04.our.net to /default-rack
> 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG impl.AMRMClientImpl - Added
> priority=1
> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> addResourceRequest: applicationId= priority=1 resourceName=hadoop-
> worknode04.our.net numContainers=1 #asks=1
> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> addResourceRequest: applicationId= priority=1 resourceName=/default-rack
> numContainers=1 #asks=2
> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> addResourceRequest: applicationId= priority=1 resourceName=*
> numContainers=1 #asks=3
> 2017-07-26 18:20:57,574 [AmExecutor-006] INFO  util.RackResolver -
> Resolved hadoop-worknode01.our.net to /default-rack
> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - Added
> priority=2
> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> addResourceRequest: applicationId= priority=2 resourceName=hadoop-
> worknode01.our.net numContainers=1 #asks=4
> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> addResourceRequest: applicationId= priority=2 resourceName=/default-rack
> numContainers=1 #asks=5
> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl -
> addResourceRequest: applicationId= priority=2 resourceName=*
> numContainers=1 #asks=6
> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG actions.QueueExecutor -
> Completed org.apache.slider.server.appmaster.actions.
> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
> attrs=4, sequenceNumber=6}
> 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO
> state.AppState - app state clusterNodes {slider-appmaster={container_
> e26_1500976506429_0025_01_000001=container_e26_1500976506429_0025_01_000001:
> 3
> state: 3
> role: slider-appmaster
> host: hadoop-worknode04.our.net
> hostURL: http://hadoop-worknode04.our.net:52967
> }}
>
> To me this looks nice, but I never see a placement request at our yarn
> resource manager node.
> It looks like slider does decide on its own, that it cannot place the
> requests. But I cannot
> see why it cannot place the requests or why the requests cannot be
> satisfied.
>
> And now I am out of any ideas ... do you have any suggestions how I can
> find out why the
> slider AM does not place the requests?
>
> Any ideas are welcome.
>
> By the way, the cluster has enough resources as the scheduler page shows:
>
>    Instantaneous Fair Share:    <memory:1441792, vCores:224>
>
> Thanks for any help in advance.

Reply via email to