Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Gary Yao Thu, 29 Mar 2018 06:40:18 -0700

Hi Juho,

The log message


  Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 20, slots allocated: 8

indicates that you do not have enough resources in your cluster left. Can
you
verify that after you started the job submission the YARN cluster does not
reach
its maximum capacity? You can also try submitting the job with a lower
parallelism.

I think the reason why the YARN application is not immediately shown as
failed
is that your restart strategy attempts to start the job 3 times. On every
attempt the job is blocked on the slot allocation timeout for at least
300000 ms
(5 minutes). I have tried submitting examples/streaming/WordCount.jar with
the
same restart strategy on EMR, and the CLI only returns after around 20
minutes.

As a side note, beginning from Flink 1.5, you do not need to specify -yn -ys
because resource allocations are dynamic by default (FLIP-6). The parameter
-yst
is deprecated and should not be needed either.

Best,
Gary

On Thu, Mar 29, 2018 at 8:59 AM, Juho Autio <juho.au...@rovio.com> wrote:

> I built a new Flink distribution from release-1.5 branch yesterday.
>
> The first time I tried to run a job with it ended up in some stalled state
> so that the job didn't manage to (re)start but what makes it worse is that
> it didn't exit as failed either.
>
> Next time I tried running the same job (but new EMR cluster & all from
> scratch) it just worked normally.
>
> On the problematic run, The YARN job was started and Flink UI was being
> served, but Flink UI kept showing status CREATED for all sub-tasks and
> nothing seemed to be happening.
>
> I found this in Job manager log first (could be unrelated) :
>
> 2018-03-28 15:26:17,449 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph
>       - Job UniqueIdStream (43ed4ace55974d3c486452a45ee5db93) switched
> from state RUNNING to FAILING.
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 20, slots allocated: 8
> at org.apache.flink.runtime.executiongraph.ExecutionGraph.
> lambda$scheduleEager$36(ExecutionGraph.java:984)
> at java.util.concurrent.CompletableFuture.uniExceptionally(
> CompletableFuture.java:870)
> at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(
> CompletableFuture.java:852)
> at java.util.concurrent.CompletableFuture.postComplete(
> CompletableFuture.java:474)
> at java.util.concurrent.CompletableFuture.completeExceptionally(
> CompletableFuture.java:1977)
> at org.apache.flink.runtime.concurrent.FutureUtils$ResultConjunctFuture.
> handleCompletedFuture(FutureUtils.java:551)
> at java.util.concurrent.CompletableFuture.uniWhenComplete(
> CompletableFuture.java:760)
> at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(
> CompletableFuture.java:736)
> at java.util.concurrent.CompletableFuture.postComplete(
> CompletableFuture.java:474)
> at java.util.concurrent.CompletableFuture.completeExceptionally(
> CompletableFuture.java:1977)
> at org.apache.flink.runtime.concurrent.FutureUtils$1.
> onComplete(FutureUtils.java:789)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.
> execute(Executors.java:83)
> at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.
> scala:44)
> at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(
> Promise.scala:252)
> at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(
> AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at scala.concurrent.Future$InternalCallbackExecutor$.
> unbatchedExecute(Future.scala:601)
> at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.
> scala:109)
> at scala.concurrent.Future$InternalCallbackExecutor$.
> execute(Future.scala:599)
> at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(
> LightArrayRevolverScheduler.scala:329)
> at akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(
> LightArrayRevolverScheduler.scala:280)
> at akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(
> LightArrayRevolverScheduler.scala:284)
> at akka.actor.LightArrayRevolverScheduler$$anon$4.run(
> LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)
>
>
> After this there was:
>
> 2018-03-28 15:26:17,521 INFO  
> org.apache.flink.runtime.executiongraph.ExecutionGraph
>       - Restarting the job UniqueIdStream (43ed4ace55974d3c486452a45ee5db
> 93).
>
>
> And some time after that:
>
> 2018-03-28 15:27:39,125 ERROR 
> org.apache.flink.runtime.blob.BlobServerConnection
>           - GET operation failed
> java.io.EOFException: Premature end of GET request
> at org.apache.flink.runtime.blob.BlobServerConnection.get(
> BlobServerConnection.java:275)
> at org.apache.flink.runtime.blob.BlobServerConnection.run(
> BlobServerConnection.java:117)
>
>
> Task manager logs didn't have any errors.
>
>
> Is that error about BlobServerConnection severe enough to make the job get
> stuck like this? Seems like a Flink bug, at least that it just gets stuck
> and doesn't either restart or make the YARN app exit as failed?
>
>
>
> My launch command is basically:
>
> flink-${FLINK_VERSION}/bin/flink run -m yarn-cluster -yn ${NODE_COUNT}
> -ys ${SLOT_COUNT} -yjm ${JOB_MANAGER_MEMORY} -ytm ${TASK_MANAGER_MEMORY}
> -yst -yD restart-strategy=fixed-delay -yD 
> restart-strategy.fixed-delay.attempts=3
> -yD "restart-strategy.fixed-delay.delay=30 s" -p ${PARALLELISM} $@
>
>
> I'm also setting this to fix some classloading error (with the previous
> build that still works)
> -yD.classloader.resolve-order=parent-first
>
>
> Cluster was AWS EMR, release 5.12.0.
>
> Thanks.
>

Re: All sub-tasks stuck in CREATED state after "BlobServerConnection: GET operation failed"

Reply via email to