Sorry, my bad. I checked the persisted jobmanager logs and can see that job
was still being restarted at 15:31 and then at 15:36. If I wouldn't have
terminated the cluster, I believe the flink job / yarn app would've
eventually exited as failed.

On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <juho.au...@rovio.com> wrote:

> Thanks again, Gary.
>
> It's true that I only let the job remain in the stuck state for something
> between 10-15 minutes. Then I shut down the cluster.
>
> But: if restart strategy is being applied, shouldn't I have seen those
> messages in job manager log? In my case it kept all quiet since ~2018-03-28
> 15:27 and I terminated it at ~28-03-2018 15:36.
>
> Do you happen to know about what that BlobServerConnection error means in
> the code? If it may lead into some unrecoverable state (where neither
> restart is attempted, nor job is failed for good)..
>
> On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <g...@data-artisans.com> wrote:
>
>> Hi Juho,
>>
>> The log message
>>
>>   Could not allocate all requires slots within timeout of 300000 ms.
>> Slots required: 20, slots allocated: 8
>>
>> indicates that you do not have enough resources in your cluster left. Can
>> you
>> verify that after you started the job submission the YARN cluster does
>> not reach
>> its maximum capacity? You can also try submitting the job with a lower
>> parallelism.
>>
>> I think the reason why the YARN application is not immediately shown as
>> failed
>> is that your restart strategy attempts to start the job 3 times. On every
>> attempt the job is blocked on the slot allocation timeout for at least
>> 300000 ms
>> (5 minutes). I have tried submitting examples/streaming/WordCount.jar
>> with the
>> same restart strategy on EMR, and the CLI only returns after around 20
>> minutes.
>>
>> As a side note, beginning from Flink 1.5, you do not need to specify -yn
>> -ys
>> because resource allocations are dynamic by default (FLIP-6). The
>> parameter -yst
>> is deprecated and should not be needed either.
>>
>> Best,
>> Gary
>>
>> On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <juho.au...@rovio.com> wrote:
>>
>>> I built a new Flink distribution from release-1.5 branch yesterday.
>>>
>>> The first time I tried to run a job with it ended up in some stalled
>>> state so that the job didn't manage to (re)start but what makes it worse is
>>> that it didn't exit as failed either.
>>>
>>> Next time I tried running the same job (but new EMR cluster & all from
>>> scratch) it just worked normally.
>>>
>>> On the problematic run, The YARN job was started and Flink UI was being
>>> served, but Flink UI kept showing status CREATED for all sub-tasks and
>>> nothing seemed to be happening.
>>>
>>> I found this in Job manager log first (could be unrelated) :
>>>
>>> 2018-03-28 15:26:17,449 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>       - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched
>>> from state RUNNING to FAILING.
>>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>>> Could not allocate all requires slots within timeout of 300000 ms. Slots
>>> required: 20, slots allocated: 8
>>> at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambd
>>> a$scheduleEager$36(ExecutionGraph.java:984)
>>> at java.util.concurrent.CompletableFuture.uniExceptionally(Comp
>>> letableFuture.java:870)
>>> at java.util.concurrent.CompletableFuture$UniExceptionally.tryF
>>> ire(CompletableFuture.java:852)
>>> at java.util.concurrent.CompletableFuture.postComplete(Completa
>>> bleFuture.java:474)
>>> at java.util.concurrent.CompletableFuture.completeExceptionally
>>> (CompletableFuture.java:1977)
>>> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjun
>>> ctFuture.handleCompletedFuture(FutureUtils.java:551)
>>> at java.util.concurrent.CompletableFuture.uniWhenComplete(Compl
>>> etableFuture.java:760)
>>> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFi
>>> re(CompletableFuture.java:736)
>>> at java.util.concurrent.CompletableFuture.postComplete(Completa
>>> bleFuture.java:474)
>>> at java.util.concurrent.CompletableFuture.completeExceptionally
>>> (CompletableFuture.java:1977)
>>> at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete
>>> (FutureUtils.java:789)
>>> at akka.dispatch.OnComplete.internal(Future.scala:258)
>>> at akka.dispatch.OnComplete.internal(Future.scala:256)
>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
>>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
>>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>>> at org.apache.flink.runtime.concurrent.Executors$DirectExecutio
>>> nContext.execute(Executors.java:83)
>>> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Prom
>>> ise.scala:44)
>>> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Pro
>>> mise.scala:252)
>>> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupp
>>> ort.scala:603)
>>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
>>> at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedE
>>> xecute(Future.scala:601)
>>> at scala.concurrent.BatchingExecutor$class.execute(BatchingExec
>>> utor.scala:109)
>>> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Fu
>>> ture.scala:599)
>>> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTas
>>> k(LightArrayRevolverScheduler.scala:329)
>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket
>>> $1(LightArrayRevolverScheduler.scala:280)
>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(Ligh
>>> tArrayRevolverScheduler.scala:284)
>>> at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArra
>>> yRevolverScheduler.scala:236)
>>> at java.lang.Thread.run(Thread.java:748)
>>>
>>>
>>> After this there was:
>>>
>>> 2018-03-28 15:26:17,521 INFO  
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph
>>>       - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5d
>>> b93).
>>>
>>>
>>> And some time after that:
>>>
>>> 2018-03-28 15:27:39,125 ERROR 
>>> org.apache.flink.runtime.blob.BlobServerConnection
>>>           - GET operation failed
>>> java.io.EOFException: Premature end of GET request
>>> at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobS
>>> erverConnection.java:275)
>>> at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobS
>>> erverConnection.java:117)
>>>
>>>
>>> Task manager logs didn't have any errors.
>>>
>>>
>>> Is that error about BlobServerConnection severe enough to make the job
>>> get stuck like this? Seems like a Flink bug, at least that it just gets
>>> stuck and doesn't either restart or make the YARN app exit as failed?
>>>
>>>
>>>
>>> My launch command is basically:
>>>
>>> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT}
>>> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY}
>>> -yst -yD restart-strategy=fixed-delay -yD 
>>> restart-strategy.fixed-delay.attempts=3
>>> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@
>>>
>>>
>>> I'm also setting this to fix some classloading error (with the previous
>>> build that still works)
>>> -yD.classloader.resolve-order=parent-first
>>>
>>>
>>> Cluster was AWS EMR, release 5.12.0.
>>>
>>> Thanks.
>>>
>>
>>
>

Reply via email to