Presto on Yarn using Slider

Frank Kemmer Fri, 28 Jul 2017 07:44:15 -0700

Greetings to all ... and sorry for the long post ... 

I am trying to deploy presto on our hadoop cluster via slider. Slider really 
looks very
promising for this task but somehow I am experiencing some glitches.


I am using the following components:

   - java version "1.8.0_131" (oracle jdk)
   - Cloudera CDH 5.11.1
   - slider-0.90.0
   - presto-yarn https://github.com/prestodb/presto-yarn

I can do:

   1. slider package install --name PRESTO --package 
./presto-yarn-package-1.6-SNAPSHOT-0.167.zip
   2. slider create presto --template ./appConfig.json --resources 
./resources.json
   
The last statement succeeds some times in bringing up presto and everything 
looks fine,
exports are right and I can access the presto coordinator by the exported 
host_port.

Then when I stop the slider cluster and start it again, the slider AM comes up, 
and says
that the placement requests are unsatisfiable by the cluster.

I experimented changing the placement.policy yarn.component.placement.policy".

Setting: yarn.component.placement.policy": "1"

Result: This works only when there is not history file and even then only some 
times, when
the requested containers are on different hosts. Sometimes slider can place the 
containers
on different hosts and everything is fine sometimes not, then it fails with 
unstable
application, since it tries to start two or more presto components on one host 
...

My understanding here: presto would need anti affinity, even between
the different component types, i.e. role COORDINATOR and WORKER are never 
allowed to run
on the same host ... they are all using a common path on the host, so only one 
presto component
can run on one host ... but yarn is not able to guarantee that ...

Setting: yarn.component.placement.policy": "4"

Result: The slider AM starts up and says:

   Diagnostics: 2 anti-affinity components have with requests unsatisfiable by 
cluster
   
My understanding here: Our yarn cannot satisfy anti-affinity ... despite 
cloudera saying
they made backports from even hadoop 3.0.0 ... I don't know, how to check that 
...

Then I had an idea, went back to

Setting: yarn.component.placement.policy": "1"

This fails first. Then I edit the history.json file to place each component on 
a different host and
hoped that would fix it. But even here the slider AM says, that the requests 
are unsatisfiable
by the cluster.

I switched on debug for slider and yarn and found the following in the 
slider.log of the slider AM:

2017-07-26 18:20:57,560 [AmExecutor-006] INFO  appmaster.SliderAppMaster - 
Registered service under /users/pentaho/services/org-apache-slider/presto; 
absolute path /registry/users/pentaho/services/org-apache-slider/presto
2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor - 
Completed 
org.apache.slider.server.appmaster.actions.ActionRegisterServiceInstance@35b17c06
 name='ActionRegisterServiceInstance', delay=0, attrs=0, sequenceNumber=5}
2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor - 
Executing 
org.apache.slider.server.appmaster.actions.ReviewAndFlexApplicationSize@9f674ac 
name='flexCluster', delay=0, attrs=4, sequenceNumber=6}
2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG appmaster.SliderAppMaster - in 
executeNodeReview(flexCluster)
2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in 
reviewRequestAndReleaseNodes()
2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState - Reviewing 
RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, desired=1, actual=0, 
requested=0, releasing=0, failed=0, startFailed=0, started=0, completed=0, 
totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0, 
limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, 
isAntiAffinePlacement=false, failureMessage='', 
providerRole=ProviderRole{name='COORDINATOR', group=COORDINATOR, id=1, 
placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300, 
labelExpression='null'}, failedContainers=[]} : 
2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState - Expected 1, 
Delta: 1
2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState - COORDINATOR: 
Asking for 1 more nodes(s) for a total of 1 
2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory - There are 1 
node(s) to consider for COORDINATOR
2017-07-26 18:20:57,567 [AmExecutor-006] INFO  state.OutstandingRequest - 
Submitting request for container on hadoop-worknode04.our.net
2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - Container ask 
is Capability[<memory:64512, vCores:1>]Priority[1] and label = null
2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - operations 
scheduled: 1; updated role: RoleStatus{name='COORDINATOR', group=COORDINATOR, 
key=1, desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0, 
started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0, 
failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512, 
vCores:1>, isAntiAffinePlacement=false, failureMessage='', 
providerRole=ProviderRole{name='COORDINATOR', group=COORDINATOR, id=1, 
placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300, 
labelExpression='null'}, failedContainers=[]}
2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - Reviewing 
RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, actual=0, 
requested=0, releasing=0, failed=0, startFailed=0, started=0, completed=0, 
totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0, 
limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, 
isAntiAffinePlacement=false, failureMessage='', 
providerRole=ProviderRole{name='WORKER', group=WORKER, id=2, placementPolicy=1, 
nodeFailureThreshold=3, placementTimeoutSeconds=300, labelExpression='null'}, 
failedContainers=[]} : 
2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - Expected 1, 
Delta: 1
2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - WORKER: Asking 
for 1 more nodes(s) for a total of 1 
2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory - There are 4 
node(s) to consider for WORKER
2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.OutstandingRequest - 
Submitting request for container on hadoop-worknode01.our.net
2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState - Container ask 
is Capability[<memory:64512, vCores:1>]Priority[2] and label = null
2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - operations 
scheduled: 1; updated role: RoleStatus{name='WORKER', group=WORKER, key=2, 
desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0, 
started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0, 
failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512, 
vCores:1>, isAntiAffinePlacement=false, failureMessage='', 
providerRole=ProviderRole{name='WORKER', group=WORKER, id=2, placementPolicy=1, 
nodeFailureThreshold=3, placementTimeoutSeconds=300, labelExpression='null'}, 
failedContainers=[]}
2017-07-26 18:20:57,570 [AmExecutor-006] INFO  util.RackResolver - Resolved 
hadoop-worknode04.our.net to /default-rack
2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG impl.AMRMClientImpl - Added 
priority=1
2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl - 
addResourceRequest: applicationId= priority=1 
resourceName=hadoop-worknode04.our.net numContainers=1 #asks=1
2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl - 
addResourceRequest: applicationId= priority=1 resourceName=/default-rack 
numContainers=1 #asks=2
2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl - 
addResourceRequest: applicationId= priority=1 resourceName=* numContainers=1 
#asks=3
2017-07-26 18:20:57,574 [AmExecutor-006] INFO  util.RackResolver - Resolved 
hadoop-worknode01.our.net to /default-rack
2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - Added 
priority=2
2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - 
addResourceRequest: applicationId= priority=2 
resourceName=hadoop-worknode01.our.net numContainers=1 #asks=4
2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - 
addResourceRequest: applicationId= priority=2 resourceName=/default-rack 
numContainers=1 #asks=5
2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - 
addResourceRequest: applicationId= priority=2 resourceName=* numContainers=1 
#asks=6
2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG actions.QueueExecutor - 
Completed 
org.apache.slider.server.appmaster.actions.ReviewAndFlexApplicationSize@9f674ac 
name='flexCluster', delay=0, attrs=4, sequenceNumber=6}
2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO  state.AppState - 
app state clusterNodes 
{slider-appmaster={container_e26_1500976506429_0025_01_000001=container_e26_1500976506429_0025_01_000001:
 3
state: 3
role: slider-appmaster
host: hadoop-worknode04.our.net
hostURL: http://hadoop-worknode04.our.net:52967
}} 

To me this looks nice, but I never see a placement request at our yarn resource 
manager node.
It looks like slider does decide on its own, that it cannot place the requests. 
But I cannot
see why it cannot place the requests or why the requests cannot be satisfied.

And now I am out of any ideas ... do you have any suggestions how I can find 
out why the
slider AM does not place the requests?

Any ideas are welcome.

By the way, the cluster has enough resources as the scheduler page shows:

   Instantaneous Fair Share:    <memory:1441792, vCores:224>

Thanks for any help in advance.

Presto on Yarn using Slider

Reply via email to