Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

p pathiyil Thu, 11 Feb 2016 20:25:53 -0800

With this setting, I can see that the next job is being executed before the
previous one is finished. However, the processing of the 'hot' partition
eventually hogs all the concurrent jobs. If there was a way to restrict
jobs to be one per partition, then this setting would provide the
per-partition isolation.


Is there anything in the framework which would give control over that
aspect ?

Thanks.


On Thu, Feb 11, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org> wrote:

> spark.streaming.concurrentJobs
>
>
> see e.g. 
> http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming
>
>
> On Thu, Feb 11, 2016 at 9:33 AM, p pathiyil <pathi...@gmail.com> wrote:
>
>> Thanks for the response Cody.
>>
>> The producers are out of my control, so can't really balance the incoming
>> content across the various topics and partitions. The number of topics and
>> partitions are quite large and the volume across then not very well known
>> ahead of time. So it is quite hard to segregate low and high volume topics
>> in to separate driver programs.
>>
>> Will look at shuffle / repartition.
>>
>> Could you share the setting for starting another batch in parallel ? It
>> might be ok to call the 'save' of the processed messages out of order if
>> that is the only consequence of this setting.
>>
>> When separate DStreams are created per partition (and if union() is not
>> called on them), what aspect of the framework still ties the scheduling of
>> jobs across the partitions together ? Asking this to see if creating
>> multiple threads in the driver and calling createDirectStream per partition
>> in those threads can provide isolation.
>>
>>
>>
>> On Thu, Feb 11, 2016 at 8:14 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> The real way to fix this is by changing partitioning, so you don't have
>>> a hot partition.  It would be better to do this at the time you're
>>> producing messages, but you can also do it with a shuffle / repartition
>>> during consuming.
>>>
>>> There is a setting to allow another batch to start in parallel, but
>>> that's likely to have unintended consequences.
>>>
>>> On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil <pathi...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking at a way to isolate the processing of messages from each
>>>> Kafka partition within the same driver.
>>>>
>>>> Scenario: A DStream is created with the createDirectStream call by
>>>> passing in a few partitions. Let us say that the streaming context is
>>>> defined to have a time duration of 2 seconds. If the processing of messages
>>>> from a single partition takes more than 2 seconds (while all the others
>>>> finish much quicker), it seems that the next set of jobs get scheduled only
>>>> after the processing of that last partition. This means that the delay is
>>>> effective for all partitions and not just the partition that was truly the
>>>> cause of the delay. What I would like to do is to have the delay only
>>>> impact the 'slow' partition.
>>>>
>>>> Tried to create one DStream per partition and then do a union of all
>>>> partitions, (similar to the sample in
>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
>>>> but that didn't seem to help.
>>>>
>>>> Please suggest the correct approach to solve this issue.
>>>>
>>>> Thanks,
>>>> Praveen.
>>>>
>>>
>>>
>>
>

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

Reply via email to