Hi,

I am looking at a way to isolate the processing of messages from each Kafka
partition within the same driver.

Scenario: A DStream is created with the createDirectStream call by passing
in a few partitions. Let us say that the streaming context is defined to
have a time duration of 2 seconds. If the processing of messages from a
single partition takes more than 2 seconds (while all the others finish
much quicker), it seems that the next set of jobs get scheduled only after
the processing of that last partition. This means that the delay is
effective for all partitions and not just the partition that was truly the
cause of the delay. What I would like to do is to have the delay only
impact the 'slow' partition.

Tried to create one DStream per partition and then do a union of all
partitions, (similar to the sample in
http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
but that didn't seem to help.

Please suggest the correct approach to solve this issue.

Thanks,
Praveen.

Reply via email to