I use spark with caching with persist method. I have several RDDs what I cache but some of them are pretty small (about 300kbytes). Most of time it works well and usually lasts 1s the whole job, but sometimes it takes about 40s to store 300kbytes to cache.
If I go to the SparkUI->Cache, I can see how the percentage is increasing until 83% (250kbytes) and then it stops for a while. If I check the event time in the Spark UI I can see that when this happen there is a node where tasks takes very long time. This node could be any from the cluster, it's not always the same. In the spark executor logs I can see it's that it takes about 40s in store 3.7kb when this problem occurs INFO 2018-08-23 12:46:58 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally INFO 2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as bytes in memory (estimated size 3.7 KB, free 1048.0 MB) INFO 2018-08-23 12:47:38 Logging.scala:54 - org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same results. I have checked the IO disk (although if I use memory_only I guess that it doesn't have sense) and I can't see any problem. This happens randomly, but it could be in the 25% of the jobs. Any idea about what it could be happening?