Greetings to all ... and sorry for the long post ... I am trying to deploy presto on our hadoop cluster via slider. Slider really looks very promising for this task but somehow I am experiencing some glitches.
I am using the following components: - java version "1.8.0_131" (oracle jdk) - Cloudera CDH 5.11.1 - slider-0.90.0 - presto-yarn https://github.com/prestodb/presto-yarn I can do: 1. slider package install --name PRESTO --package ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip 2. slider create presto --template ./appConfig.json --resources ./resources.json The last statement succeeds some times in bringing up presto and everything looks fine, exports are right and I can access the presto coordinator by the exported host_port. Then when I stop the slider cluster and start it again, the slider AM comes up, and says that the placement requests are unsatisfiable by the cluster. I experimented changing the placement.policy yarn.component.placement.policy". Setting: yarn.component.placement.policy": "1" Result: This works only when there is not history file and even then only some times, when the requested containers are on different hosts. Sometimes slider can place the containers on different hosts and everything is fine sometimes not, then it fails with unstable application, since it tries to start two or more presto components on one host ... My understanding here: presto would need anti affinity, even between the different component types, i.e. role COORDINATOR and WORKER are never allowed to run on the same host ... they are all using a common path on the host, so only one presto component can run on one host ... but yarn is not able to guarantee that ... Setting: yarn.component.placement.policy": "4" Result: The slider AM starts up and says: Diagnostics: 2 anti-affinity components have with requests unsatisfiable by cluster My understanding here: Our yarn cannot satisfy anti-affinity ... despite cloudera saying they made backports from even hadoop 3.0.0 ... I don't know, how to check that ... Then I had an idea, went back to Setting: yarn.component.placement.policy": "1" This fails first. Then I edit the history.json file to place each component on a different host and hoped that would fix it. But even here the slider AM says, that the requests are unsatisfiable by the cluster. I switched on debug for slider and yarn and found the following in the slider.log of the slider AM: 2017-07-26 18:20:57,560 [AmExecutor-006] INFO appmaster.SliderAppMaster - Registered service under /users/pentaho/services/org-apache-slider/presto; absolute path /registry/users/pentaho/services/org-apache-slider/presto 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor - Completed org.apache.slider.server.appmaster.actions.ActionRegisterServiceInstance@35b17c06 name='ActionRegisterServiceInstance', delay=0, attrs=0, sequenceNumber=5} 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG actions.QueueExecutor - Executing org.apache.slider.server.appmaster.actions.ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0, attrs=4, sequenceNumber=6} 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG appmaster.SliderAppMaster - in executeNodeReview(flexCluster) 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in reviewRequestAndReleaseNodes() 2017-07-26 18:20:57,566 [AmExecutor-006] INFO state.AppState - Reviewing RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, desired=1, actual=0, requested=0, releasing=0, failed=0, startFailed=0, started=0, completed=0, totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, isAntiAffinePlacement=false, failureMessage='', providerRole=ProviderRole{name='COORDINATOR', group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]} : 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState - Expected 1, Delta: 1 2017-07-26 18:20:57,566 [AmExecutor-006] INFO state.AppState - COORDINATOR: Asking for 1 more nodes(s) for a total of 1 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory - There are 1 node(s) to consider for COORDINATOR 2017-07-26 18:20:57,567 [AmExecutor-006] INFO state.OutstandingRequest - Submitting request for container on hadoop-worknode04.our.net 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - Container ask is Capability[<memory:64512, vCores:1>]Priority[1] and label = null 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - operations scheduled: 1; updated role: RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0, started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, isAntiAffinePlacement=false, failureMessage='', providerRole=ProviderRole{name='COORDINATOR', group=COORDINATOR, id=1, placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]} 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - Reviewing RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, actual=0, requested=0, releasing=0, failed=0, startFailed=0, started=0, completed=0, totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, isAntiAffinePlacement=false, failureMessage='', providerRole=ProviderRole{name='WORKER', group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]} : 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - Expected 1, Delta: 1 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - WORKER: Asking for 1 more nodes(s) for a total of 1 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory - There are 4 node(s) to consider for WORKER 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.OutstandingRequest - Submitting request for container on hadoop-worknode01.our.net 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - Container ask is Capability[<memory:64512, vCores:1>]Priority[2] and label = null 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - operations scheduled: 1; updated role: RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, actual=0, requested=1, releasing=0, failed=0, startFailed=0, started=0, completed=0, totalRequested=1, preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, isAntiAffinePlacement=false, failureMessage='', providerRole=ProviderRole{name='WORKER', group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3, placementTimeoutSeconds=300, labelExpression='null'}, failedContainers=[]} 2017-07-26 18:20:57,570 [AmExecutor-006] INFO util.RackResolver - Resolved hadoop-worknode04.our.net to /default-rack 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG impl.AMRMClientImpl - Added priority=1 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl - addResourceRequest: applicationId= priority=1 resourceName=hadoop-worknode04.our.net numContainers=1 #asks=1 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl - addResourceRequest: applicationId= priority=1 resourceName=/default-rack numContainers=1 #asks=2 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG impl.AMRMClientImpl - addResourceRequest: applicationId= priority=1 resourceName=* numContainers=1 #asks=3 2017-07-26 18:20:57,574 [AmExecutor-006] INFO util.RackResolver - Resolved hadoop-worknode01.our.net to /default-rack 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - Added priority=2 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - addResourceRequest: applicationId= priority=2 resourceName=hadoop-worknode01.our.net numContainers=1 #asks=4 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - addResourceRequest: applicationId= priority=2 resourceName=/default-rack numContainers=1 #asks=5 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG impl.AMRMClientImpl - addResourceRequest: applicationId= priority=2 resourceName=* numContainers=1 #asks=6 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG actions.QueueExecutor - Completed org.apache.slider.server.appmaster.actions.ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0, attrs=4, sequenceNumber=6} 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO state.AppState - app state clusterNodes {slider-appmaster={container_e26_1500976506429_0025_01_000001=container_e26_1500976506429_0025_01_000001: 3 state: 3 role: slider-appmaster host: hadoop-worknode04.our.net hostURL: http://hadoop-worknode04.our.net:52967 }} To me this looks nice, but I never see a placement request at our yarn resource manager node. It looks like slider does decide on its own, that it cannot place the requests. But I cannot see why it cannot place the requests or why the requests cannot be satisfied. And now I am out of any ideas ... do you have any suggestions how I can find out why the slider AM does not place the requests? Any ideas are welcome. By the way, the cluster has enough resources as the scheduler page shows: Instantaneous Fair Share: <memory:1441792, vCores:224> Thanks for any help in advance.