Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
Our additional question on checkpointing is basically the logistics of it -- At which point does the data get written into checkpointing? Is it written as soon as the driver program retrieves an RDD from Kafka (or another source)? Or, is it written after that RDD has been processed and we're

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Cody Koeninger
You'll resume and re-process the rdd that didnt finish On Fri, Aug 14, 2015 at 1:31 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: Our additional question on checkpointing is basically the logistics of it -- At which point does the data get written into checkpointing? Is it written

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-08-14 Thread Dmitry Goldenberg
Thanks, Cody. It sounds like Spark Streaming has enough state info to know how many batches have been processed and if not all of them then the RDD is 'unfinished'. I wonder if it would know whether the last micro-batch has been fully processed successfully. Hypothetically, the driver program

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Sean Owen
If you've set the checkpoint dir, it seems like indeed the intent is to use a default checkpoint interval in DStream: private[streaming] def initialize(time: Time) { ... // Set the checkpoint interval to be slideDuration or 10 seconds, which ever is larger if (mustCheckpoint

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
It looks like there's an issue with the 'Parameters' pojo I'm using within my driver program. For some reason that needs to be serializable, which is odd. java.io.NotSerializableException: com.kona.consumer.kafka.spark.Parameters Giving it another whirl though having to make it serializable

Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in those directories nor am I seeing the effects I'd expect from checkpointing. I'd expect any data that comes into

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Dmitry Goldenberg
I'll check the log info message.. Meanwhile, the code is basically public class KafkaSparkStreamingDriver implements Serializable { .. SparkConf sparkConf = createSparkConf(appName, kahunaEnv); JavaStreamingContext jssc = params.isCheckpointed() ?

Re: Checkpointing doesn't appear to be working for direct streaming from Kafka

2015-07-31 Thread Cody Koeninger
Show us the relevant code On Fri, Jul 31, 2015 at 12:16 PM, Dmitry Goldenberg dgoldenberg...@gmail.com wrote: I've instrumented checkpointing per the programming guide and I can tell that Spark Streaming is creating the checkpoint directories but I'm not seeing any content being created in