Sorry, my bad. I checked the persisted jobmanager logs and can see that job was still being restarted at 15:31 and then at 15:36. If I wouldn't have terminated the cluster, I believe the flink job / yarn app would've eventually exited as failed.
On Thu, Mar 29, 2018 at 4:49 PM, Juho Autio <juho.au...@rovio.com> wrote: > Thanks again, Gary. > > It's true that I only let the job remain in the stuck state for something > between 10-15 minutes. Then I shut down the cluster. > > But: if restart strategy is being applied, shouldn't I have seen those > messages in job manager log? In my case it kept all quiet since ~2018-03-28 > 15:27 and I terminated it at ~28-03-2018 15:36. > > Do you happen to know about what that BlobServerConnection error means in > the code? If it may lead into some unrecoverable state (where neither > restart is attempted, nor job is failed for good).. > > On Thu, Mar 29, 2018 at 4:39 PM, Gary Yao <g...@data-artisans.com> wrote: > >> Hi Juho, >> >> The log message >> >> Could not allocate all requires slots within timeout of 300000 ms. >> Slots required: 20, slots allocated: 8 >> >> indicates that you do not have enough resources in your cluster left. Can >> you >> verify that after you started the job submission the YARN cluster does >> not reach >> its maximum capacity? You can also try submitting the job with a lower >> parallelism. >> >> I think the reason why the YARN application is not immediately shown as >> failed >> is that your restart strategy attempts to start the job 3 times. On every >> attempt the job is blocked on the slot allocation timeout for at least >> 300000 ms >> (5 minutes). I have tried submitting examples/streaming/WordCount.jar >> with the >> same restart strategy on EMR, and the CLI only returns after around 20 >> minutes. >> >> As a side note, beginning from Flink 1.5, you do not need to specify -yn >> -ys >> because resource allocations are dynamic by default (FLIP-6). The >> parameter -yst >> is deprecated and should not be needed either. >> >> Best, >> Gary >> >> On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <juho.au...@rovio.com> wrote: >> >>> I built a new Flink distribution from release-1.5 branch yesterday. >>> >>> The first time I tried to run a job with it ended up in some stalled >>> state so that the job didn't manage to (re)start but what makes it worse is >>> that it didn't exit as failed either. >>> >>> Next time I tried running the same job (but new EMR cluster & all from >>> scratch) it just worked normally. >>> >>> On the problematic run, The YARN job was started and Flink UI was being >>> served, but Flink UI kept showing status CREATED for all sub-tasks and >>> nothing seemed to be happening. >>> >>> I found this in Job manager log first (could be unrelated) : >>> >>> 2018-03-28 15:26:17,449 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>> - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched >>> from state RUNNING to FAILING. >>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >>> Could not allocate all requires slots within timeout of 300000 ms. Slots >>> required: 20, slots allocated: 8 >>> at org.apache.flink.runtime.executiongraph.ExecutionGraph.lambd >>> a$scheduleEager$36(ExecutionGraph.java:984) >>> at java.util.concurrent.CompletableFuture.uniExceptionally(Comp >>> letableFuture.java:870) >>> at java.util.concurrent.CompletableFuture$UniExceptionally.tryF >>> ire(CompletableFuture.java:852) >>> at java.util.concurrent.CompletableFuture.postComplete(Completa >>> bleFuture.java:474) >>> at java.util.concurrent.CompletableFuture.completeExceptionally >>> (CompletableFuture.java:1977) >>> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjun >>> ctFuture.handleCompletedFuture(FutureUtils.java:551) >>> at java.util.concurrent.CompletableFuture.uniWhenComplete(Compl >>> etableFuture.java:760) >>> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFi >>> re(CompletableFuture.java:736) >>> at java.util.concurrent.CompletableFuture.postComplete(Completa >>> bleFuture.java:474) >>> at java.util.concurrent.CompletableFuture.completeExceptionally >>> (CompletableFuture.java:1977) >>> at org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete >>> (FutureUtils.java:789) >>> at akka.dispatch.OnComplete.internal(Future.scala:258) >>> at akka.dispatch.OnComplete.internal(Future.scala:256) >>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) >>> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) >>> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >>> at org.apache.flink.runtime.concurrent.Executors$DirectExecutio >>> nContext.execute(Executors.java:83) >>> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Prom >>> ise.scala:44) >>> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Pro >>> mise.scala:252) >>> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupp >>> ort.scala:603) >>> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) >>> at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedE >>> xecute(Future.scala:601) >>> at scala.concurrent.BatchingExecutor$class.execute(BatchingExec >>> utor.scala:109) >>> at scala.concurrent.Future$InternalCallbackExecutor$.execute(Fu >>> ture.scala:599) >>> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTas >>> k(LightArrayRevolverScheduler.scala:329) >>> at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket >>> $1(LightArrayRevolverScheduler.scala:280) >>> at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(Ligh >>> tArrayRevolverScheduler.scala:284) >>> at akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArra >>> yRevolverScheduler.scala:236) >>> at java.lang.Thread.run(Thread.java:748) >>> >>> >>> After this there was: >>> >>> 2018-03-28 15:26:17,521 INFO >>> org.apache.flink.runtime.executiongraph.ExecutionGraph >>> - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5d >>> b93). >>> >>> >>> And some time after that: >>> >>> 2018-03-28 15:27:39,125 ERROR >>> org.apache.flink.runtime.blob.BlobServerConnection >>> - GET operation failed >>> java.io.EOFException: Premature end of GET request >>> at org.apache.flink.runtime.blob.BlobServerConnection.get(BlobS >>> erverConnection.java:275) >>> at org.apache.flink.runtime.blob.BlobServerConnection.run(BlobS >>> erverConnection.java:117) >>> >>> >>> Task manager logs didn't have any errors. >>> >>> >>> Is that error about BlobServerConnection severe enough to make the job >>> get stuck like this? Seems like a Flink bug, at least that it just gets >>> stuck and doesn't either restart or make the YARN app exit as failed? >>> >>> >>> >>> My launch command is basically: >>> >>> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT} >>> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY} >>> -yst -yD restart-strategy=fixed-delay -yD >>> restart-strategy.fixed-delay.attempts=3 >>> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@ >>> >>> >>> I'm also setting this to fix some classloading error (with the previous >>> build that still works) >>> -yD.classloader.resolve-order=parent-first >>> >>> >>> Cluster was AWS EMR, release 5.12.0. >>> >>> Thanks. >>> >> >> >