Hi Akshay Thanks for the response please find below the answers to your questions.
1. We are running Spark in cluster mode the cluster manager being Spark's standalone cluster manager. 2. All the ports are open and we preconfigure on what ports the communication should happen and modify firewall rules to allow traffic on these ports. (The functionality is fine till Spark master goes down after 60 mins) 3. Memory consumptions of all the components: Spark Master: S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 0.00 12.91 35.11 97.08 95.80 5 0.239 2 0.197 0.436 Spark Worker: S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 51.64 0.00 46.66 27.44 97.57 95.85 10 0.381 2 0.233 0.613 Spark Submit Process (Driver): S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 63.57 93.82 26.29 98.24 97.53 4663 124.648 109 20.910 145.558 Spark executor (Coarse grained): S0 S1 E O M CCS YGC YGCT FGC FGCT GCT 0.00 69.77 17.74 31.13 95.67 90.44 7353 556.888 5 1.572 558.460 On Thu, Feb 28, 2019 at 3:13 PM Akshay Bhardwaj < akshay.bhardwaj1...@gmail.com> wrote: > Hi Lokesh, > > Please provide further information to help identify the issue. > > 1) Are you running in a standalone mode or cluster mode? If cluster, then > is a spark master/slave or YARN/Mesos? > 2) Have you tried checking if all ports between your master and the > machine with IP 192.168.43.167 are accessible? > 3) Have you checked the memory consumption of the executors/driver running > in the cluster? > > > Akshay Bhardwaj > +91-97111-33849 > > > On Wed, Feb 27, 2019 at 8:27 PM lokeshkumar <lok...@dataken.net> wrote: > >> Hi All >> >> We are running Spark version 2.4.0 and we run few Spark streaming jobs >> listening on Kafka topics. We receive an average of 10-20 msgs per >> second. >> And the Spark master has been going down after 1-2 hours of it running. >> Exception is given below: >> Along with that spark executors also get killed. >> >> This was not happening with Spark 2.1.1 it started happening with Spark >> 2.4.0 any help/suggestion is appreciated. >> >> The exception that we see is >> >> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException >> at >> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) >> at >> >> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) >> at >> >> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) >> at >> >> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) >> at >> >> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) >> Caused by: org.apache.spark.rpc.RpcTimeoutException: Cannot receive any >> reply from 192.168.43.167:40007 in 120 seconds. This timeout is >> controlled >> by spark.rpc.askTimeout >> at >> org.apache.spark.rpc.RpcTimeout.org >> $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) >> at >> >> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) >> at >> >> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) >> at >> >> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) >> at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:216) >> at scala.util.Try$.apply(Try.scala:192) >> at scala.util.Failure.recover(Try.scala:216) >> at >> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) >> at >> scala.concurrent.Future$$anonfun$recover$1.apply(Future.scala:326) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >> at >> >> org.spark_project.guava.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:293) >> at >> >> scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136) >> at >> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) >> at >> >> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) >> at scala.concurrent.Promise$class.complete(Promise.scala:55) >> at >> scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:157) >> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) >> at scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) >> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) >> at >> >> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:63) >> at >> >> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:78) >> at >> >> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) >> at >> >> scala.concurrent.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:55) >> at >> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72) >> at >> scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:54) >> at >> >> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) >> at >> >> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:106) >> at >> >> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) >> at >> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) >> at >> >> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) >> at scala.concurrent.Promise$class.tryFailure(Promise.scala:112) >> at >> scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:157) >> at >> org.apache.spark.rpc.netty.NettyRpcEnv.org >> $apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:206) >> at >> org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:243) >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) >> at >> >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> at >> >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply >> from 192.168.43.167:40007 in 120 seconds >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> -- Regards -Lokesh