Re: Problems with too many checkpoint files with Spark Streaming

2016-01-06 Thread Tathagata Das
Could you show a sample of the file names? There are multiple things that are using UUIDs so would be good to see what are 100s of directories that being generated every second. If you are checkpointing every 400s then there shouldnt be checkpoint directories written every second. They should be hu

Problems with too many checkpoint files with Spark Streaming

2016-01-06 Thread Jan Algermissen
Hi, we are running a streaming job that processes about 500 events per 20s batches and uses updateStateByKey to accumulate Web sessions (with a 30 Minute live time). The checkpoint intervall is set to 20xBatchInterval, that is 400s. Cluster size is 8 nodes. We are having trouble with the amou