To add more info, this project is on an older version of Spark, 1.5.0, and on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1).
On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg <dgoldenb...@kmwllc.com> wrote: > Hi, > > I've got 3 questions/issues regarding checkpointing, was hoping someone > could help shed some light on this. > > We've got a Spark Streaming consumer consuming data from a Kafka topic; > works fine generally until I switch it to the checkpointing mode by calling > the 'checkpoint' method on the context and pointing the checkpointing at a > directory in HDFS. > > I can see that files get written to that directory however I don't see new > Kafka content being processed. > > *Question 1.* Is it possible that the checkpointed consumer is off base > in its understanding of where the offsets are on the topic and how could I > troubleshoot that? Is it possible that some "confusion" happens if a > consumer is switched back and forth between checkpointed and not? How could > we tell? > > *Question 2.* About spark.streaming.receiver.writeAheadLog.enable. By > default this is false. "All the input data received through receivers > will be saved to write ahead logs that will allow it to be recovered after > driver failures." So if we don't set this to true, what *will* get saved > into checkpointing and what data *will* be recovered upon the driver > restarting? > > *Question 3.* We want the RDD's to be treated as successfully processed > only once we have done all the necessary transformations and actions on the > data. By default, will the Spark Streaming checkpointing simply mark the > topic offsets as having been processed once the data has been received by > Spark? Or, once the data has been processed by the driver + the workers > successfully? If the former, how can we configure checkpointing to do the > latter? > > Thanks, > - Dmitry >