Hello, have question , we seeing below exceptions, and at the moment are enabling JVM profiler to look into gc activity on workers and if you have any other suggestions do let know please , we dont just want increase rpc timeout (from 120) to 600 sec lets say but get to reason why workers timeout
but the question also is , it appears this timeout causes data loss, data in mappers does not make to reducers and back to driver to collect the data so the resiliency is lost it seems. because spark does not look like it is retrying data at some later time but gives up and input record is lost -Andrew 16/02/08 08:33:37 ERROR YarnScheduler: Lost executor 265 on ip-172-20-35-115.ec2.internal: remote Rpc client disassociated [Stage 4313813:> (0 + 94) / 95]16/02/08 08:35:35 ERROR ContextCleaner: Error cleaning broadcast 4311376 org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org $apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242) at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:136) at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:228) at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45) at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:67) at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:214) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:170) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:161) at scala.Option.foreach(Option.scala:236) at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:161) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1136) at org.apache.spark.ContextCleaner.org $apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:154) [2/8/16, 9:25 AM] Chad Ladensack (c...@revcontent.com):I'll still send over the log file [2/8/16, 9:25 AM] Chad Ladensack (c...@revcontent.com):Some guy online was asking about garbage collection and how it can throw off executors [2/8/16, 9:25 AM] Chad Ladensack (c...@revcontent.com): http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning <http://www.google.com/url?q=http%3A%2F%2Fspark.apache.org%2Fdocs%2Flatest%2Ftuning.html%23garbage-collection-tuning&sa=D&sntz=1&usg=AFQjCNH5hSX1iidn3Lr66b91tkeaw9JGcA> [2/8/16, 9:26 AM] Chad Ladensack (c...@revcontent.com):He referenced the tuning guide and it shows how to measure it [2/8/16, 9:26 AM] Chad Ladensack (c...@revcontent.com):idk, maybe worth looking into? [2/8/16, 9:26 AM] Chad Ladensack (c...@revcontent.com):I'll get the log and email it now