Re: spark streaming doubt

Dibyendu Bhattacharya Tue, 19 May 2015 09:01:05 -0700

Just to add, there is a Receiver based Kafka consumer which uses Kafka Low
Level Consumer API.


http://spark-packages.org/package/dibbhatt/kafka-spark-consumer


Regards,
Dibyendu

On Tue, May 19, 2015 at 9:00 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

>
> On Tue, May 19, 2015 at 8:10 PM, Shushant Arora <shushantaror...@gmail.com
> > wrote:
>
>> So for Kafka+spark streaming, Receiver based streaming used highlevel api
>> and non receiver based streaming used low level api.
>>
>> 1.In high level receiver based streaming does it registers consumers at
>> each job start(whenever a new job is launched by streaming application say
>> at each second)?
>>
>
> -> Receiver based streaming will always have the receiver running
> parallel while your job is running, So by default for every 200ms
> (spark.streaming.blockInterval) the receiver will generate a block of data
> which is read from Kafka.
> 
>
>
>> 2.No of executors in highlevel receiver based jobs will always equal to
>> no of partitions in topic ?
>>
>
> -> Not sure from where did you came up with this. For the non stream
> based one, i think the number of partitions in spark will be equal to the
> number of kafka partitions for the given topic.
> 
>
>
>> 3.Will data from a single topic be consumed by executors in parllel or
>> only one receiver consumes in multiple threads and assign to executors in
>> high level receiver based approach ?
>>
>> -> They will consume the data parallel. For the receiver based
> approach, you can actually specify the number of receiver that you want to
> spawn for consuming the messages.
>
>>
>>
>>
>> On Tue, May 19, 2015 at 2:38 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> spark.streaming.concurrentJobs takes an integer value, not boolean. If
>>> you set it as 2 then 2 jobs will run parallel. Default value is 1 and the
>>> next job will start once it completes the current one.
>>>
>>>
>>>> Actually, in the current implementation of Spark Streaming and under
>>>> default configuration, only job is active (i.e. under execution) at any
>>>> point of time. So if one batch's processing takes longer than 10 seconds,
>>>> then then next batch's jobs will stay queued.
>>>> This can be changed with an experimental Spark property
>>>> "spark.streaming.concurrentJobs" which is by default set to 1. Its not
>>>> currently documented (maybe I should add it).
>>>> The reason it is set to 1 is that concurrent jobs can potentially lead
>>>> to weird sharing of resources and which can make it hard to debug the
>>>> whether there is sufficient resources in the system to process the ingested
>>>> data fast enough. With only 1 job running at a time, it is easy to see that
>>>> if batch processing time < batch interval, then the system will be stable.
>>>> Granted that this may not be the most efficient use of resources under
>>>> certain conditions. We definitely hope to improve this in the future.
>>>
>>>
>>> Copied from TD's answer written in SO
>>> <http://stackoverflow.com/questions/23528006/how-jobs-are-assigned-to-executors-in-spark-streaming>
>>> .
>>>
>>> Non-receiver based streaming for example you can say are the fileStream,
>>> directStream ones. You can read a bit of information from here
>>> https://spark.apache.org/docs/1.3.1/streaming-kafka-integration.html
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Tue, May 19, 2015 at 2:13 PM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
>>>> Thanks Akhil.
>>>> When I don't  set spark.streaming.concurrentJobs to true. Will the all
>>>> pending jobs starts one by one after 1 jobs completes,or it does not
>>>> creates jobs which could not be started at its desired interval.
>>>>
>>>> And Whats the difference and usage of Receiver vs non-receiver based
>>>> streaming. Is there any documentation for that?
>>>>
>>>> On Tue, May 19, 2015 at 1:35 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> It will be a single job running at a time by default (you can also
>>>>> configure the spark.streaming.concurrentJobs to run jobs parallel which is
>>>>> not recommended to put in production).
>>>>>
>>>>> Now, your batch duration being 1 sec and processing time being 2
>>>>> minutes, if you are using a receiver based streaming then ideally those
>>>>> receivers will keep on receiving data while the job is running (which will
>>>>> accumulate in memory if you set StorageLevel as MEMORY_ONLY and end up in
>>>>> block not found exceptions as spark drops some blocks which are yet to
>>>>> process to accumulate new blocks). If you are using a non-receiver based
>>>>> approach, you will not have this problem of dropping blocks.
>>>>>
>>>>> Ideally, if your data is small and you have enough memory to hold your
>>>>> data then it will run smoothly without any issues.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Tue, May 19, 2015 at 1:23 PM, Shushant Arora <
>>>>> shushantaror...@gmail.com> wrote:
>>>>>
>>>>>> What happnes if in a streaming application one job is not yet
>>>>>> finished and stream interval reaches. Does it starts next job or wait for
>>>>>> first to finish and rest jobs will keep on accumulating in queue.
>>>>>>
>>>>>>
>>>>>> Say I have a streaming application with stream interval of 1 sec, but
>>>>>> my job takes 2 min to process 1 sec stream , what will happen ?  At any
>>>>>> time there will be only one job running or multiple ?
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: spark streaming doubt

Reply via email to