Re: Spark 2.4.0 Master going down

Lokesh Kumar Padhnavis Thu, 28 Feb 2019 01:55:52 -0800

Hi Akshay

Thanks for the response please find below the answers to your questions.


1. We are running Spark in cluster mode the cluster manager being Spark's
standalone cluster manager.
2. All the ports are open and we preconfigure on what ports the
communication should happen and modify firewall rules to allow traffic on
these ports. (The functionality is fine till Spark master goes down after
60 mins)
3. Memory consumptions of all the components:

Spark Master:
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
 GCT
  0.00   0.00  12.91  35.11  97.08  95.80      5    0.239     2    0.197
0.436
Spark Worker:
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
 GCT
 51.64   0.00  46.66  27.44  97.57  95.85     10    0.381     2    0.233
0.613
Spark Submit Process (Driver):
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
 GCT
  0.00  63.57  93.82  26.29  98.24  97.53   4663  124.648   109   20.910
145.558
Spark executor (Coarse grained):
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT
 GCT
  0.00  69.77  17.74  31.13  95.67  90.44   7353  556.888     5    1.572
558.460



On Thu, Feb 28, 2019 at 3:13 PM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:

> Hi Lokesh,
>
> Please provide further information to help identify the issue.
>
> 1) Are you running in a standalone mode or cluster mode? If cluster, then
> is a spark master/slave or YARN/Mesos?
> 2) Have you tried checking if all ports between your master and the
> machine with IP 192.168.43.167 are accessible?
> 3) Have you checked the memory consumption of the executors/driver running
> in the cluster?
>
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar <lok...@dataken.net> wrote:
>
>> Hi All
>>
>> We are running Spark version 2.4.0 and we run few Spark streaming jobs
>> listening on Kafka topics. We receive an average of 10-20 msgs per
>> second.
>> And the Spark master has been going down after 1-2 hours of it running.
>> Exception is given below:
>> Along with that spark executors also get killed.
>>
>> This was not happening with Spark 2.1.1 it started happening with Spark
>> 2.4.0 any help/suggestion is appreciated.
>>
>> The exception that we see is
>>
>> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>>         at
>>
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
>>         at
>>
>> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
>>         at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
>>         at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
>>         at
>>
>> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
>> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any
>> reply from 192.168.43.167:40007 in 120 seconds. This timeout is
>> controlled
>> by spark.rpc.askTimeout
>>         at
>> org.apache.spark.rpc.RpcTimeout.org
>> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
>>         at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
>>         at
>>
>> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
>>         at
>>
>> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
>>         at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216)
>>         at scala.util.Try$.apply(Try.scala:192)
>>         at scala.util.Failure.recover(Try.scala:216)
>>         at
>> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
>>         at
>> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326)
>>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>>         at
>>
>> org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293)
>>         at
>>
>> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
>>         at
>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>>         at
>>
>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>>         at scala.concurrent.Promise$class.complete(Promise.scala:55)
>>         at
>> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157)
>>         at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
>>         at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237)
>>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
>>         at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63)
>>         at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78)
>>         at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>>         at
>>
>> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55)
>>         at
>> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>>         at
>> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54)
>>         at
>>
>> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
>>         at
>>
>> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106)
>>         at
>>
>> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
>>         at
>> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
>>         at
>>
>> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
>>         at scala.concurrent.Promise$class.tryFailure(Promise.scala:112)
>>         at
>> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157)
>>         at
>> org.apache.spark.rpc.netty.NettyRpcEnv.org
>> $apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206)
>>         at
>> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243)
>>         at
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>>         at
>>
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>         at
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>         at java.lang.Thread.run(Thread.java:745)
>> Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply
>> from 192.168.43.167:40007 in 120 seconds
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

-- 
Regards
-Lokesh

Re: Spark 2.4.0 Master going down

Reply via email to