In the receiver based approach, If the receiver crashes for any reason
(receiver crashed or executor crashed) the receiver should get restarted on
another executor and should start reading data from the offset present in
the zookeeper. There is some chance of data loss which can alleviated using
Write Ahead Logs (see streaming programming guide for more details, or see
my talk [Slides PDF
<http://www.slideshare.net/SparkSummit/recipes-for-running-spark-streaming-apploications-in-production-tathagata-daspptx>
, Video
<https://www.youtube.com/watch?v=d5UJonrruHk&list=PL-x35fyliRwgfhffEpywn4q23ykotgQJ6&index=4>
] from last Spark Summit 2015). But that approach can give duplicate
records. The direct approach gives exactly-once guarantees, so you should
try it out.

TD

On Fri, Jun 26, 2015 at 5:46 PM, Cody Koeninger <c...@koeninger.org> wrote:

> Read the spark streaming guide ad the kafka integration guide for a better
> understanding of how the receiver based stream works.
>
> Capacity planning is specific to your environment and what the job is
> actually doing, youll need to determine it empirically.
>
>
> On Friday, June 26, 2015, Shushant Arora <shushantaror...@gmail.com>
> wrote:
>
>> In 1.2 how to handle offset management after stream application starts in
>> each job . I should commit offset after job completion manually?
>>
>> And what is recommended no of consumer threads. Say I have 300 partitions
>> in kafka cluster . Load is ~ 1 million events per second.Each event is of
>> ~500bytes. Having 5 receivers with 60 partitions each receiver is
>> sufficient for spark streaming to consume ?
>>
>> On Fri, Jun 26, 2015 at 8:40 PM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> The receiver-based kafka createStream in spark 1.2 uses zookeeper to
>>> store offsets.  If you want finer-grained control over offsets, you can
>>> update the values in zookeeper yourself before starting the job.
>>>
>>> createDirectStream in spark 1.3 is still marked as experimental, and
>>> subject to change.  That being said, it works better for me in production
>>> than the receiver based api.
>>>
>>> On Fri, Jun 26, 2015 at 6:43 AM, Shushant Arora <
>>> shushantaror...@gmail.com> wrote:
>>>
>>>> I am using spark streaming 1.2.
>>>>
>>>> If processing executors get crashed will receiver rest the offset back
>>>> to last processed offset?
>>>>
>>>> If receiver itself got crashed is there a way to reset the offset
>>>> without restarting streaming application other than smallest or largest.
>>>>
>>>>
>>>> Is spark streaming 1.3  which uses low level consumer api, stabe? And
>>>> which is recommended for handling data  loss 1.2 or 1.3 .
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>

Reply via email to