Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

Sebastian Piu Thu, 11 Feb 2016 22:49:08 -0800

Have you tried using fair scheduler and queues
On 12 Feb 2016 4:24 a.m., "p pathiyil" <pathi...@gmail.com> wrote:


> With this setting, I can see that the next job is being executed before
> the previous one is finished. However, the processing of the 'hot'
> partition eventually hogs all the concurrent jobs. If there was a way to
> restrict jobs to be one per partition, then this setting would provide the
> per-partition isolation.
>
> Is there anything in the framework which would give control over that
> aspect ?
>
> Thanks.
>
>
> On Thu, Feb 11, 2016 at 9:55 PM, Cody Koeninger <c...@koeninger.org>
> wrote:
>
>> spark.streaming.concurrentJobs
>>
>>
>> see e.g. 
>> http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming
>>
>>
>> On Thu, Feb 11, 2016 at 9:33 AM, p pathiyil <pathi...@gmail.com> wrote:
>>
>>> Thanks for the response Cody.
>>>
>>> The producers are out of my control, so can't really balance the
>>> incoming content across the various topics and partitions. The number of
>>> topics and partitions are quite large and the volume across then not very
>>> well known ahead of time. So it is quite hard to segregate low and high
>>> volume topics in to separate driver programs.
>>>
>>> Will look at shuffle / repartition.
>>>
>>> Could you share the setting for starting another batch in parallel ? It
>>> might be ok to call the 'save' of the processed messages out of order if
>>> that is the only consequence of this setting.
>>>
>>> When separate DStreams are created per partition (and if union() is not
>>> called on them), what aspect of the framework still ties the scheduling of
>>> jobs across the partitions together ? Asking this to see if creating
>>> multiple threads in the driver and calling createDirectStream per partition
>>> in those threads can provide isolation.
>>>
>>>
>>>
>>> On Thu, Feb 11, 2016 at 8:14 PM, Cody Koeninger <c...@koeninger.org>
>>> wrote:
>>>
>>>> The real way to fix this is by changing partitioning, so you don't have
>>>> a hot partition.  It would be better to do this at the time you're
>>>> producing messages, but you can also do it with a shuffle / repartition
>>>> during consuming.
>>>>
>>>> There is a setting to allow another batch to start in parallel, but
>>>> that's likely to have unintended consequences.
>>>>
>>>> On Thu, Feb 11, 2016 at 7:59 AM, p pathiyil <pathi...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I am looking at a way to isolate the processing of messages from each
>>>>> Kafka partition within the same driver.
>>>>>
>>>>> Scenario: A DStream is created with the createDirectStream call by
>>>>> passing in a few partitions. Let us say that the streaming context is
>>>>> defined to have a time duration of 2 seconds. If the processing of 
>>>>> messages
>>>>> from a single partition takes more than 2 seconds (while all the others
>>>>> finish much quicker), it seems that the next set of jobs get scheduled 
>>>>> only
>>>>> after the processing of that last partition. This means that the delay is
>>>>> effective for all partitions and not just the partition that was truly the
>>>>> cause of the delay. What I would like to do is to have the delay only
>>>>> impact the 'slow' partition.
>>>>>
>>>>> Tried to create one DStream per partition and then do a union of all
>>>>> partitions, (similar to the sample in
>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#reducing-the-batch-processing-times),
>>>>> but that didn't seem to help.
>>>>>
>>>>> Please suggest the correct approach to solve this issue.
>>>>>
>>>>> Thanks,
>>>>> Praveen.
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark Streaming with Kafka: Dealing with 'slow' partitions

Reply via email to