Hello All, We have a Spark Streaming job that reads data from DB (three tables) and cache them into memory ONLY at the start then it will happily carry out the incremental calculation with the new data. What we've noticed occasionally is that one of the RDDs caches only 90% of the data. Therefore, at each execution time the remaining 10% had to be recomputed. This operation is very expensive as it will need to establish a connection to DB and transform the data.We are allocating more than enough memories for each executor ( 4 Executors and each has 2GB memory). Have you come across this issue? Do you know what may cause this issue?
Another observation: I started the job yesterday it cached 100% of the RDD but looking at it today it show 90%. what happened to that 10% of data?