To add more info, this project is on an older version of Spark, 1.5.0, and
on an older version of Kafka which is 0.8.2.1 (2.10-0.8.2.1).

On Tue, Apr 2, 2019 at 11:39 AM Dmitry Goldenberg <dgoldenb...@kmwllc.com>
wrote:

> Hi,
>
> I've got 3 questions/issues regarding checkpointing, was hoping someone
> could help shed some light on this.
>
> We've got a Spark Streaming consumer consuming data from a Kafka topic;
> works fine generally until I switch it to the checkpointing mode by calling
> the 'checkpoint' method on the context and pointing the checkpointing at a
> directory in HDFS.
>
> I can see that files get written to that directory however I don't see new
> Kafka content being processed.
>
> *Question 1.* Is it possible that the checkpointed consumer is off base
> in its understanding of where the offsets are on the topic and how could I
> troubleshoot that?  Is it possible that some "confusion" happens if a
> consumer is switched back and forth between checkpointed and not? How could
> we tell?
>
> *Question 2.* About spark.streaming.receiver.writeAheadLog.enable. By
> default this is false. "All the input data received through receivers
> will be saved to write ahead logs that will allow it to be recovered after
> driver failures."  So if we don't set this to true, what *will* get saved
> into checkpointing and what data *will* be recovered upon the driver
> restarting?
>
> *Question 3.* We want the RDD's to be treated as successfully processed
> only once we have done all the necessary transformations and actions on the
> data.  By default, will the Spark Streaming checkpointing simply mark the
> topic offsets as having been processed once the data has been received by
> Spark?  Or, once the data has been processed by the driver + the workers
> successfully?  If the former, how can we configure checkpointing to do the
> latter?
>
> Thanks,
> - Dmitry
>

Reply via email to