Hi all,

I’m running cluster consisting of a master and four slaves. The cluster runs a 
Spark application that reads data from a Kafka topic over a window of time, and 
writes the data back to Kafka. Checkpointing is enabled by using HDFS. However, 
although Spark periodically commits checkpoints to HDFS, I cannot see any 
recovery traces in the log4j logs (even if configured with DEBUG verbosity). 
Hence, how does Spark handle failures of slaves in terms of state persistence 
and recovery considering the use of window operations? Will the state be loaded 
from HDFS using the last snapshot and divided between n executors for recovery? 

Thanks,
Dominik

Reply via email to