Hi,

I have simple spark streaming application which reads data from Kafka and
then send this data after transformation on a http end point (or another
kafka - for this question let's consider http). I am submitting jobs using
job-server <https://github.com/spark-jobserver/spark-jobserver>.

I am currently starting the consumption from source kafka with
"auto.offset.reset"="smallest" and interval=3s. In happy case everything
looks good. Here's an excerpt:

kafkaInputDStream.foreachRDD(rdd => {
  rdd.foreach(item => {
  //This will throw exception if http endpoint isn't reachable
      httpProcessor.process(item._1, item._2)
  })
})

Since "auto.offset.reset"="smallest", this processes about 200K messages in
one job. If I stop http server mid-job (simulating some issue in POSTing)
and httpProcessor.process throws exception, that Job fails and whatever is
unprocessed is lost. I see it keeps on polling every 3 seconds after that.

So my question is:

   1. Is my assumption right that if in next 3 second job if it got X
   messages and only Y could be processed before hitting an error, rest X-Y
   will not be processed?
   2. Is there a way to pause the stream/consumption from Kafka? For
   instance in case there's a intermittent network issue and most likely all
   messages consumed will be lost in that time. Something which keeps on
   retrying (maybe exponential backoff) and whenever http end point is UP,
   start consuming again.

I see the failed job rerun after about an hour, can this be configured? :
Failed Jobs (2)
Job IdDescriptionSubmittedDurationStages: Succeeded/TotalTasks (for all
stages): Succeeded/Total
4253 Streaming job from [output operation 1, batch time 12:54:44]foreach at
HTTPStream.scala:73 2016/04/20 12:54:44 0.3 s 0/1 (1 failed)
2/3 (20 failed)
0 Streaming job from [output operation 1, batch time 11:43:51]foreach at
HTTPStream.scala:73 2016/04/20 11:43:51 3 s 0/1 (1 failed)
0/3 (46 failed)


Thanks,

K

Reply via email to