Re: Presto on Yarn using Slider

Billie Rinaldi Tue, 01 Aug 2017 06:48:31 -0700

I don't think it would matter which scheduler you are using.

On Tue, Aug 1, 2017 at 1:16 AM, Frank Kemmer <frank.kem...@1und1.de> wrote:


> Does slider only work with the capacity scheduler? We are using the fair
> scheduler …
>
> --
>
> Am 31.07.17, 18:58 schrieb "Billie Rinaldi" <billie.rina...@gmail.com>:
>
>     It looks like it's not very obvious from the INFO log, so maybe
> changing
>     the RM to DEBUG would help. Below is what my RM log looks like between
> an
>     app transitioning to RUNNING and having its first container allocated,
> and
>     I don't see a specific log line about a request being received. Still,
>     being able to start an app the first time and not being able to
> restart it
>     is strange. I would check whether the app processes are actually
> getting
>     stopped (maybe if they are still using resources for some reason, that
>     would interfere with new allocations on the same hosts). Another thing
> I am
>     wondering is whether the host names that Slider is using for requests
> match
>     the host names the RM is using for NMs. If Slider requested a host
> that the
>     RM didn't know about, that could cause containers not to be allocated.
>
>     2017-07-28 15:43:36,219 INFO rmapp.RMAppImpl:
>     application_1501255339386_0002 State change from ACCEPTED to RUNNING on
>     event = ATTEMPT_REGISTERED
>     2017-07-28 15:43:36,865 INFO allocator.AbstractContainerAllocator:
>     assignedContainer application attempt=appattempt_
> 1501255339386_0002_000001
>     container=null
>     queue=org.apache.hadoop.yarn.server.resourcemanager.
> scheduler.capacity.allocator.RegularContainerAllocator@4b10a7b0
>     clusterResource=<memory:8192, vCores:8> type=OFF_SWITCH
> requestedPartition=
>     2017-07-28 15:43:36,866 INFO capacity.ParentQueue: assignedContainer
>     queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125
> used=<memory:1024,
>     vCores:1> cluster=<memory:8192, vCores:8>
>     2017-07-28 15:43:36,866 INFO rmcontainer.RMContainerImpl:
>     container_e02_1501255339386_0002_01_000002 Container Transitioned
> from NEW
>     to ALLOCATED
>
>
>     On Mon, Jul 31, 2017 at 9:18 AM, Frank Kemmer <frank.kem...@1und1.de>
> wrote:
>
>     > Hi Billie,
>     >
>     > do you have any ideas, how I can find out, why the yarn resource
> manager
>     > is denying the requests … do I need to set the yarn log to debug
> mode to
>     > see such requests? In my opinion a denied request should result in a
>     > Warning
>     > or at least in an info … in the yarn log I can see some denied
>     > reservations,
>     > but never for requests from the slider AM …
>     >
>     > I will ask our IT to set the debug flag for yarn and try to find out
> why
>     > yarn says
>     > no to the placement. My feeling is, that yarn does not allow
> placements on
>     > hosts … but then I am really lost … ;)
>     >
>     > > Am 31.07.2017 um 17:59 schrieb Billie Rinaldi <
> billie.rina...@gmail.com
>     > >:
>     > >
>     > > Hi Frank,
>     > >
>     > >> My understanding here: Our yarn cannot satisfy anti-affinity ...
> despite
>     > > cloudera saying they made backports from even hadoop 3.0.0 ...
>     > > Anti-affinity is not implemented in YARN at this time; it's
> implemented
>     > > entirely in Slider using its role history, so this should not be an
>     > issue.
>     > >
>     > >> they are all using a common path on the host
>     > > This may be a problem. Slider is not designed to support apps that
> store
>     > > data locally; app components might be brought up on any server.
> (See
>     > > http://slider.incubator.apache.org/docs/slider_specs/
>     > application_needs.html
>     > > for more information on Slider's expectations about apps.) You
> could get
>     > > around this somewhat by using strict placement (policy 1), so that
>     > > components will be restarted on the same nodes they were started
>     > > previously. But if your component also needs anti-affinity, Slider
> does
>     > not
>     > > have a way to combine anti-affinity and strict placement (first
> spread
>     > the
>     > > components out, then after that only start them on the same nodes
> where
>     > > they were started previously). I have speculated whether it would
> be
>     > > possible to configure an app for anti-affinity for the first
> start, then
>     > > change it to strict, but I have not attempted this.
>     > >
>     > >> To me this looks nice, but I never see a placement request at our
> yarn
>     > > resource manager node. It looks like slider does decide on its
> own, that
>     > it
>     > > cannot place the requests.
>     > > This is strange. You should see the container request in the RM.
> Slider
>     > > does not decide on its own; the "requests unsatisfiable" message
> only
>     > means
>     > > that a request has been sent to the RM and the RM has not
> allocated a
>     > > container.
>     > >
>     > >
>     > > On Fri, Jul 28, 2017 at 7:43 AM, Frank Kemmer <
> frank.kem...@1und1.de>
>     > wrote:
>     > >
>     > >> Greetings to all ... and sorry for the long post ...
>     > >>
>     > >> I am trying to deploy presto on our hadoop cluster via slider.
> Slider
>     > >> really looks very
>     > >> promising for this task but somehow I am experiencing some
> glitches.
>     > >>
>     > >> I am using the following components:
>     > >>
>     > >>   - java version "1.8.0_131" (oracle jdk)
>     > >>   - Cloudera CDH 5.11.1
>     > >>   - slider-0.90.0
>     > >>   - presto-yarn https://github.com/prestodb/presto-yarn
>     > >>
>     > >> I can do:
>     > >>
>     > >>   1. slider package install --name PRESTO --package
>     > >> ./presto-yarn-package-1.6-SNAPSHOT-0.167.zip
>     > >>   2. slider create presto --template ./appConfig.json --resources
>     > >> ./resources.json
>     > >>
>     > >> The last statement succeeds some times in bringing up presto and
>     > >> everything looks fine,
>     > >> exports are right and I can access the presto coordinator by the
>     > exported
>     > >> host_port.
>     > >>
>     > >> Then when I stop the slider cluster and start it again, the
> slider AM
>     > >> comes up, and says
>     > >> that the placement requests are unsatisfiable by the cluster.
>     > >>
>     > >> I experimented changing the placement.policy
> yarn.component.placement.
>     > >> policy".
>     > >>
>     > >> Setting: yarn.component.placement.policy": "1"
>     > >>
>     > >> Result: This works only when there is not history file and even
> then
>     > only
>     > >> some times, when
>     > >> the requested containers are on different hosts. Sometimes slider
> can
>     > >> place the containers
>     > >> on different hosts and everything is fine sometimes not, then it
> fails
>     > >> with unstable
>     > >> application, since it tries to start two or more presto
> components on
>     > one
>     > >> host ...
>     > >>
>     > >> My understanding here: presto would need anti affinity, even
> between
>     > >> the different component types, i.e. role COORDINATOR and WORKER
> are
>     > never
>     > >> allowed to run
>     > >> on the same host ... they are all using a common path on the
> host, so
>     > only
>     > >> one presto component
>     > >> can run on one host ... but yarn is not able to guarantee that ...
>     > >>
>     > >> Setting: yarn.component.placement.policy": "4"
>     > >>
>     > >> Result: The slider AM starts up and says:
>     > >>
>     > >>   Diagnostics: 2 anti-affinity components have with requests
>     > >> unsatisfiable by cluster
>     > >>
>     > >> My understanding here: Our yarn cannot satisfy anti-affinity ...
> despite
>     > >> cloudera saying
>     > >> they made backports from even hadoop 3.0.0 ... I don't know, how
> to
>     > check
>     > >> that ...
>     > >>
>     > >> Then I had an idea, went back to
>     > >>
>     > >> Setting: yarn.component.placement.policy": "1"
>     > >>
>     > >> This fails first. Then I edit the history.json file to place each
>     > >> component on a different host and
>     > >> hoped that would fix it. But even here the slider AM says, that
> the
>     > >> requests are unsatisfiable
>     > >> by the cluster.
>     > >>
>     > >> I switched on debug for slider and yarn and found the following
> in the
>     > >> slider.log of the slider AM:
>     > >>
>     > >> 2017-07-26 18:20:57,560 [AmExecutor-006] INFO
>     > appmaster.SliderAppMaster -
>     > >> Registered service under /users/pentaho/services/org-
>     > apache-slider/presto;
>     > >> absolute path /registry/users/pentaho/services/org-apache-slider/
> presto
>     > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG
> actions.QueueExecutor -
>     > >> Completed org.apache.slider.server.appmaster.actions.
>     > >> ActionRegisterServiceInstance@35b17c06 name='
>     > >> ActionRegisterServiceInstance', delay=0, attrs=0,
> sequenceNumber=5}
>     > >> 2017-07-26 18:20:57,564 [AmExecutor-006] DEBUG
> actions.QueueExecutor -
>     > >> Executing org.apache.slider.server.appmaster.actions.
>     > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
>     > >> attrs=4, sequenceNumber=6}
>     > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG
>     > appmaster.SliderAppMaster -
>     > >> in executeNodeReview(flexCluster)
>     > >> 2017-07-26 18:20:57,565 [AmExecutor-006] DEBUG state.AppState - in
>     > >> reviewRequestAndReleaseNodes()
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
>     > Reviewing
>     > >> RoleStatus{name='COORDINATOR', group=COORDINATOR, key=1,
> desired=1,
>     > >> actual=0, requested=0, releasing=0, failed=0, startFailed=0,
> started=0,
>     > >> completed=0, totalRequested=0, preempted=0, nodeFailed=0,
>     > failedRecently=0,
>     > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
>     > >> isAntiAffinePlacement=false, failureMessage='',
>     > providerRole=ProviderRole{name='COORDINATOR',
>     > >> group=COORDINATOR, id=1, placementPolicy=1,
> nodeFailureThreshold=3,
>     > >> placementTimeoutSeconds=300, labelExpression='null'},
>     > failedContainers=[]} :
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.AppState -
> Expected
>     > >> 1, Delta: 1
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] INFO  state.AppState -
>     > >> COORDINATOR: Asking for 1 more nodes(s) for a total of 1
>     > >> 2017-07-26 18:20:57,566 [AmExecutor-006] DEBUG state.RoleHistory
> - There
>     > >> are 1 node(s) to consider for COORDINATOR
>     > >> 2017-07-26 18:20:57,567 [AmExecutor-006] INFO
> state.OutstandingRequest
>     > -
>     > >> Submitting request for container on hadoop-worknode04.our.net
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
>     > Container
>     > >> ask is Capability[<memory:64512, vCores:1>]Priority[1] and label
> = null
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
>     > operations
>     > >> scheduled: 1; updated role: RoleStatus{name='COORDINATOR',
>     > >> group=COORDINATOR, key=1, desired=1, actual=0, requested=1,
> releasing=0,
>     > >> failed=0, startFailed=0, started=0, completed=0, totalRequested=1,
>     > >> preempted=0, nodeFailed=0, failedRecently=0, limitsExceeded=0,
>     > >> resourceRequirements=<memory:64512, vCores:1>,
>     > >> isAntiAffinePlacement=false, failureMessage='',
>     > providerRole=ProviderRole{name='COORDINATOR',
>     > >> group=COORDINATOR, id=1, placementPolicy=1,
> nodeFailureThreshold=3,
>     > >> placementTimeoutSeconds=300, labelExpression='null'},
>     > failedContainers=[]}
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
>     > Reviewing
>     > >> RoleStatus{name='WORKER', group=WORKER, key=2, desired=1,
> actual=0,
>     > >> requested=0, releasing=0, failed=0, startFailed=0, started=0,
>     > completed=0,
>     > >> totalRequested=0, preempted=0, nodeFailed=0, failedRecently=0,
>     > >> limitsExceeded=0, resourceRequirements=<memory:64512, vCores:1>,
>     > >> isAntiAffinePlacement=false, failureMessage='',
>     > providerRole=ProviderRole{name='WORKER',
>     > >> group=WORKER, id=2, placementPolicy=1, nodeFailureThreshold=3,
>     > >> placementTimeoutSeconds=300, labelExpression='null'},
>     > failedContainers=[]} :
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
> Expected
>     > >> 1, Delta: 1
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
> WORKER:
>     > >> Asking for 1 more nodes(s) for a total of 1
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.RoleHistory
> - There
>     > >> are 4 node(s) to consider for WORKER
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO
> state.OutstandingRequest
>     > -
>     > >> Submitting request for container on hadoop-worknode01.our.net
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] INFO  state.AppState -
>     > Container
>     > >> ask is Capability[<memory:64512, vCores:1>]Priority[2] and label
> = null
>     > >> 2017-07-26 18:20:57,569 [AmExecutor-006] DEBUG state.AppState -
>     > operations
>     > >> scheduled: 1; updated role: RoleStatus{name='WORKER',
> group=WORKER,
>     > key=2,
>     > >> desired=1, actual=0, requested=1, releasing=0, failed=0,
> startFailed=0,
>     > >> started=0, completed=0, totalRequested=1, preempted=0,
> nodeFailed=0,
>     > >> failedRecently=0, limitsExceeded=0, resourceRequirements=<memory:
> 64512,
>     > >> vCores:1>, isAntiAffinePlacement=false, failureMessage='',
>     > >> providerRole=ProviderRole{name='WORKER', group=WORKER, id=2,
>     > >> placementPolicy=1, nodeFailureThreshold=3,
> placementTimeoutSeconds=300,
>     > >> labelExpression='null'}, failedContainers=[]}
>     > >> 2017-07-26 18:20:57,570 [AmExecutor-006] INFO  util.RackResolver -
>     > >> Resolved hadoop-worknode04.our.net to /default-rack
>     > >> 2017-07-26 18:20:57,570 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > Added
>     > >> priority=1
>     > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=1 resourceName=hadoop-
>     > >> worknode04.our.net numContainers=1 #asks=1
>     > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=1
> resourceName=/default-rack
>     > >> numContainers=1 #asks=2
>     > >> 2017-07-26 18:20:57,573 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=1 resourceName=*
>     > >> numContainers=1 #asks=3
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] INFO  util.RackResolver -
>     > >> Resolved hadoop-worknode01.our.net to /default-rack
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > Added
>     > >> priority=2
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=2 resourceName=hadoop-
>     > >> worknode01.our.net numContainers=1 #asks=4
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=2
> resourceName=/default-rack
>     > >> numContainers=1 #asks=5
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> impl.AMRMClientImpl -
>     > >> addResourceRequest: applicationId= priority=2 resourceName=*
>     > >> numContainers=1 #asks=6
>     > >> 2017-07-26 18:20:57,574 [AmExecutor-006] DEBUG
> actions.QueueExecutor -
>     > >> Completed org.apache.slider.server.appmaster.actions.
>     > >> ReviewAndFlexApplicationSize@9f674ac name='flexCluster', delay=0,
>     > >> attrs=4, sequenceNumber=6}
>     > >> 2017-07-26 18:21:22,803 [1708490318@qtp-1547994163-0] INFO
>     > >> state.AppState - app state clusterNodes
> {slider-appmaster={container_
>     > >> e26_1500976506429_0025_01_000001=container_e26_
>     > 1500976506429_0025_01_000001:
>     > >> 3
>     > >> state: 3
>     > >> role: slider-appmaster
>     > >> host: hadoop-worknode04.our.net
>     > >> hostURL: http://hadoop-worknode04.our.net:52967
>     > >> }}
>     > >>
>     > >> To me this looks nice, but I never see a placement request at our
> yarn
>     > >> resource manager node.
>     > >> It looks like slider does decide on its own, that it cannot place
> the
>     > >> requests. But I cannot
>     > >> see why it cannot place the requests or why the requests cannot be
>     > >> satisfied.
>     > >>
>     > >> And now I am out of any ideas ... do you have any suggestions how
> I can
>     > >> find out why the
>     > >> slider AM does not place the requests?
>     > >>
>     > >> Any ideas are welcome.
>     > >>
>     > >> By the way, the cluster has enough resources as the scheduler page
>     > shows:
>     > >>
>     > >>   Instantaneous Fair Share:    <memory:1441792, vCores:224>
>     > >>
>     > >> Thanks for any help in advance.
>     >
>     >
>
>
>

Re: Presto on Yarn using Slider

Reply via email to