Hello All,

We have a Spark Streaming job that reads data from DB (three tables) and
cache them into memory ONLY at the start then it will happily carry out the
incremental calculation with the new data. What we've noticed occasionally
is that one of the RDDs caches only 90% of the data. Therefore, at each
execution time the remaining 10% had to be recomputed. This operation is
very expensive as it will need to establish a connection to DB and
transform the data.We are allocating more than enough memories for each
executor ( 4 Executors and each has 2GB memory). Have you come across this
issue? Do you know what may cause this issue?

Another observation:
I started the job yesterday it cached 100% of the RDD but looking at it
today it show 90%. what happened to that 10% of data?

Reply via email to