I don't think it would matter which scheduler you are using. On Tue, Aug 1, 2017 at 1:16 AM, Frank Kemmer <frank.kem...@1und1.de> wrote:
> Does slider only work with the capacity scheduler? We are using the fair > scheduler … > > -- > > Am 31.07.17, 18:58 schrieb "Billie Rinaldi" <billie.rina...@gmail.com>: > > It looks like it's not very obvious from the INFO log, so maybe > changing > the RM to DEBUG would help. Below is what my RM log looks like between > an > app transitioning to RUNNING and having its first container allocated, > and > I don't see a specific log line about a request being received. Still, > being able to start an app the first time and not being able to > restart it > is strange. I would check whether the app processes are actually > getting > stopped (maybe if they are still using resources for some reason, that > would interfere with new allocations on the same hosts). Another thing > I am > wondering is whether the host names that Slider is using for requests > match > the host names the RM is using for NMs. If Slider requested a host > that the > RM didn't know about, that could cause containers not to be allocated. > > 2017-07-28 15:43:36,219 INFO rmapp.RMAppImpl: > application_1501255339386_0002 State change from ACCEPTED to RUNNING on > event = ATTEMPT_REGISTERED > 2017-07-28 15:43:36,865 INFO allocator.AbstractContainerAllocator: > assignedContainer application attempt=appattempt_ > 1501255339386_0002_000001 > container=null > queue=org.apache.hadoop.yarn.server.resourcemanager. > scheduler.capacity.allocator.RegularContainerAllocator@4b10a7b0 > clusterResource=<memory:8192, vCores:8> type=OFF_SWITCH > requestedPartition= > 2017-07-28 15:43:36,866 INFO capacity.ParentQueue: assignedContainer > queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125 > used=<memory:1024, > vCores:1> cluster=<memory:8192, vCores:8> > 2017-07-28 15:43:36,866 INFO rmcontainer.RMContainerImpl: > container_e02_1501255339386_0002_01_000002 Container Transitioned > from NEW > to ALLOCATED > > > On Mon, Jul 31, 2017 at 9:18 AM, Frank Kemmer <frank.kem...@1und1.de> > wrote: > > > Hi Billie, > > > > do you have any ideas, how I can find out, why the yarn resource > manager > > is denying the requests … do I need to set the yarn log to debug > mode to > > see such requests? In my opinion a denied request should result in a > > Warning > > or at least in an info … in the yarn log I can see some denied > > reservations, > > but never for requests from the slider AM … > > > > I will ask our IT to set the debug flag for yarn and try to find out > why > > yarn says > > no to the placement. My feeling is, that yarn does not allow > placements on > > hosts … but then I am really lost … ;) > > > > > Am 31.07.2017 um 17:59 schrieb Billie Rinaldi < > billie.rina...@gmail.com > > >: > > > > > > Hi Frank, > > > > > >> My understanding here: Our yarn cannot satisfy anti-affinity ... > despite > > > cloudera saying they made backports from even hadoop 3.0.0 ... > > > Anti-affinity is not implemented in YARN at this time; it's > implemented > > > entirely in Slider using its role history, so this should not be an > > issue. > > > > > >> they are all using a common path on the host > > > This may be a problem. Slider is not designed to support apps that > store > > > data locally; app components might be brought up on any server. > (See > > > http://slider.incubator.apache.org/docs/slider_specs/ > > application_needs.html > > > for more information on Slider's expectations about apps.) You > could get > > > around this somewhat by using strict placement (policy 1), so that > > > components will be restarted on the same nodes they were started > > > previously. But if your component also needs anti-affinity, Slider > does > > not > > > have a way to combine anti-affinity and strict placement (first > spread > > the > > > components out, then after that only start them on the same nodes > where > > > they were started previously). I have speculated whether it would > be > > > possible to configure an app for anti-affinity for the first > start, then > > > change it to strict, but I have not attempted this. > > > > > >> To me this looks nice, but I never see a placement request at our > yarn > > > resource manager node. It looks like slider does decide on its > own, that > > it > > > cannot place the requests. > > > This is strange. You should see the container request in the RM. > Slider > > > does not decide on its own; the "requests unsatisfiable" message > only > > means > > > that a request has been sent to the RM and the RM has not > allocated a > > > container. > > > > > > > > > On Fri, Jul 28, 2017 at 7:43 AM, Frank Kemmer < > frank.kem...@1und1.de> > > wrote: > > > > > >> Greetings to all ... and sorry for the long post ... > > >> > > >> I am trying to deploy presto on our hadoop cluster via slider. > Slider > > >> really looks very > > >> promising for this task but somehow I am experiencing some > glitches. > > >> > > >> I am using the following components: > > >> > > >> - java version "1.8.0_131" (oracle jdk) > > >> - Cloudera CDH 5.11.1 > > >> - slider-0.90.0 > > >> - presto-yarn https://github.com/prestodb/presto-yarn > > >> > > >> I can do: > > >> > > >> 1. slider package install --name PRESTO --package > > >> ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip > > >> 2. slider create presto --template ./appConfig.json --resources > > >> ./resources.json > > >> > > >> The last statement succeeds some times in bringing up presto and > > >> everything looks fine, > > >> exports are right and I can access the presto coordinator by the > > exported > > >> host_port. > > >> > > >> Then when I stop the slider cluster and start it again, the > slider AM > > >> comes up, and says > > >> that the placement requests are unsatisfiable by the cluster. > > >> > > >> I experimented changing the placement.policy > yarn.component.placement. > > >> policy". > > >> > > >> Setting: yarn.component.placement.policy": "1" > > >> > > >> Result: This works only when there is not history file and even > then > > only > > >> some times, when > > >> the requested containers are on different hosts. Sometimes slider > can > > >> place the containers > > >> on different hosts and everything is fine sometimes not, then it > fails > > >> with unstable > > >> application, since it tries to start two or more presto > components on > > one > > >> host ... > > >> > > >> My understanding here: presto would need anti affinity, even > between > > >> the different component types, i.e. role COORDINATOR and WORKER > are > > never > > >> allowed to run > > >> on the same host ... they are all using a common path on the > host, so > > only > > >> one presto component > > >> can run on one host ... but yarn is not able to guarantee that ... > > >> > > >> Setting: yarn.component.placement.policy": "4" > > >> > > >> Result: The slider AM starts up and says: > > >> > > >> Diagnostics: 2 anti-affinity components have with requests > > >> unsatisfiable by cluster > > >> > > >> My understanding here: Our yarn cannot satisfy anti-affinity ... > despite > > >> cloudera saying > > >> they made backports from even hadoop 3.0.0 ... I don't know, how > to > > check > > >> that ... > > >> > > >> Then I had an idea, went back to > > >> > > >> Setting: yarn.component.placement.policy": "1" > > >> > > >> This fails first. Then I edit the history.json file to place each > > >> component on a different host and > > >> hoped that would fix it. But even here the slider AM says, that > the > > >> requests are unsatisfiable > > >> by the cluster. > > >> > > >> I switched on debug for slider and yarn and found the following > in the > > >> slider.log of the slider AM: > > >> > > >> 2017-07-26 18:20:57,560 [AmExecutor-006] INFO > > appmaster.SliderAppMaster - > > >> Registered service under /users/pentaho/services/org- > > apache-slider/presto; > > >> absolute path /registry/users/pentaho/services/org-apache-slider/ > presto > > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG > actions.QueueExecutor - > > >> Completed org.apache.slider.server.appmaster.actions. > > >> ActionRegisterServiceInstance@35b17c06 name=' > > >> ActionRegisterServiceInstance', delay=0, attrs=0, > sequenceNumber=5} > > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG > actions.QueueExecutor - > > >> Executing org.apache.slider.server.appmaster.actions. > > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0, > > >> attrs=4, sequenceNumber=6} > > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG > > appmaster.SliderAppMaster - > > >> in executeNodeReview(flexCluster) > > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in > > >> reviewRequestAndReleaseNodes() > > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO state.AppState - > > Reviewing > > >> RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1, > desired=1, > > >> actual=0, requested=0, releasing=0, failed=0, startFailed=0, > started=0, > > >> completed=0, totalRequested=0, preempted=0, nodeFailed=0, > > failedRecently=0, > > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, > > >> isAntiAffinePlacement=false, failureMessage='', > > providerRole=ProviderRole{name='COORDINATOR', > > >> group=COORDINATOR, id=1, placementPolicy=1, > nodeFailureThreshold=3, > > >> placementTimeoutSeconds=300, labelExpression='null'}, > > failedContainers=[]} : > > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState - > Expected > > >> 1, Delta: 1 > > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO state.AppState - > > >> COORDINATOR: Asking for 1 more nodes(s) for a total of 1 > > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory > - There > > >> are 1 node(s) to consider for COORDINATOR > > >> 2017-07-26 18:20:57,567 [AmExecutor-006] INFO > state.OutstandingRequest > > - > > >> Submitting request for container on hadoop-worknode04.our.net > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - > > Container > > >> ask is Capability[<memory:64512, vCores:1>]Priority[1] and label > = null > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - > > operations > > >> scheduled: 1; updated role: RoleStatus{name='COORDINATOR', > > >> group=COORDINATOR, key=1, desired=1, actual=0, requested=1, > releasing=0, > > >> failed=0, startFailed=0, started=0, completed=0, totalRequested=1, > > >> preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0, > > >> resourceRequirements=<memory:64512, vCores:1>, > > >> isAntiAffinePlacement=false, failureMessage='', > > providerRole=ProviderRole{name='COORDINATOR', > > >> group=COORDINATOR, id=1, placementPolicy=1, > nodeFailureThreshold=3, > > >> placementTimeoutSeconds=300, labelExpression='null'}, > > failedContainers=[]} > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - > > Reviewing > > >> RoleStatus{name='WORKER', group=WORKER, key=2, desired=1, > actual=0, > > >> requested=0, releasing=0, failed=0, startFailed=0, started=0, > > completed=0, > > >> totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0, > > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>, > > >> isAntiAffinePlacement=false, failureMessage='', > > providerRole=ProviderRole{name='WORKER', > > >> group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3, > > >> placementTimeoutSeconds=300, labelExpression='null'}, > > failedContainers=[]} : > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - > Expected > > >> 1, Delta: 1 > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - > WORKER: > > >> Asking for 1 more nodes(s) for a total of 1 > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory > - There > > >> are 4 node(s) to consider for WORKER > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO > state.OutstandingRequest > > - > > >> Submitting request for container on hadoop-worknode01.our.net > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO state.AppState - > > Container > > >> ask is Capability[<memory:64512, vCores:1>]Priority[2] and label > = null > > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState - > > operations > > >> scheduled: 1; updated role: RoleStatus{name='WORKER', > group=WORKER, > > key=2, > > >> desired=1, actual=0, requested=1, releasing=0, failed=0, > startFailed=0, > > >> started=0, completed=0, totalRequested=1, preempted=0, > nodeFailed=0, > > >> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory: > 64512, > > >> vCores:1>, isAntiAffinePlacement=false, failureMessage='', > > >> providerRole=ProviderRole{name='WORKER', group=WORKER, id=2, > > >> placementPolicy=1, nodeFailureThreshold=3, > placementTimeoutSeconds=300, > > >> labelExpression='null'}, failedContainers=[]} > > >> 2017-07-26 18:20:57,570 [AmExecutor-006] INFO util.RackResolver - > > >> Resolved hadoop-worknode04.our.net to /default-rack > > >> 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > Added > > >> priority=1 > > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > >> addResourceRequest: applicationId= priority=1 resourceName=hadoop- > > >> worknode04.our.net numContainers=1 #asks=1 > > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > >> addResourceRequest: applicationId= priority=1 > resourceName=/default-rack > > >> numContainers=1 #asks=2 > > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > >> addResourceRequest: applicationId= priority=1 resourceName=* > > >> numContainers=1 #asks=3 > > >> 2017-07-26 18:20:57,574 [AmExecutor-006] INFO util.RackResolver - > > >> Resolved hadoop-worknode01.our.net to /default-rack > > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > Added > > >> priority=2 > > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > >> addResourceRequest: applicationId= priority=2 resourceName=hadoop- > > >> worknode01.our.net numContainers=1 #asks=4 > > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > >> addResourceRequest: applicationId= priority=2 > resourceName=/default-rack > > >> numContainers=1 #asks=5 > > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG > impl.AMRMClientImpl - > > >> addResourceRequest: applicationId= priority=2 resourceName=* > > >> numContainers=1 #asks=6 > > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG > actions.QueueExecutor - > > >> Completed org.apache.slider.server.appmaster.actions. > > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0, > > >> attrs=4, sequenceNumber=6} > > >> 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO > > >> state.AppState - app state clusterNodes > {slider-appmaster={container_ > > >> e26_1500976506429_0025_01_000001=container_e26_ > > 1500976506429_0025_01_000001: > > >> 3 > > >> state: 3 > > >> role: slider-appmaster > > >> host: hadoop-worknode04.our.net > > >> hostURL: http://hadoop-worknode04.our.net:52967 > > >> }} > > >> > > >> To me this looks nice, but I never see a placement request at our > yarn > > >> resource manager node. > > >> It looks like slider does decide on its own, that it cannot place > the > > >> requests. But I cannot > > >> see why it cannot place the requests or why the requests cannot be > > >> satisfied. > > >> > > >> And now I am out of any ideas ... do you have any suggestions how > I can > > >> find out why the > > >> slider AM does not place the requests? > > >> > > >> Any ideas are welcome. > > >> > > >> By the way, the cluster has enough resources as the scheduler page > > shows: > > >> > > >> Instantaneous Fair Share: <memory:1441792, vCores:224> > > >> > > >> Thanks for any help in advance. > > > > > > >