Re: Spark Streaming 1.3 & Kafka Direct Streams

Cody Koeninger Wed, 01 Apr 2015 12:20:28 -0700

As I said in the original ticket, I think the implementation classes should
be exposed so that people can subclass and override compute() to suit their
needs.


Just adding a function from Time => Set[TopicAndPartition] wouldn't be
sufficient for some of my current production use cases.

compute() isn't really a function from Time => Option[KafkaRDD], it's a
function from (Time, current offsets, kafka metadata, etc) =>
Option[KafkaRDD]

I think it's more straightforward to give access to that additional state
via subclassing than it is to add in more callbacks for every possible use
case.




On Wed, Apr 1, 2015 at 2:01 PM, Tathagata Das <t...@databricks.com> wrote:

> We should be able to support that use case in the direct API. It may be as
> simple as allowing the users to pass on a function that returns the set of
> topic+partitions to read from.
> That is function (Time) => Set[TopicAndPartition] This gets called every
> batch interval before the offsets are decided. This would allow users to
> add topics, delete topics, modify partitions on the fly.
>
> What do you think Cody?
>
>
>
>
> On Wed, Apr 1, 2015 at 11:57 AM, Neelesh <neele...@gmail.com> wrote:
>
>> Thanks Cody!
>>
>> On Wed, Apr 1, 2015 at 11:21 AM, Cody Koeninger <c...@koeninger.org>
>> wrote:
>>
>>> If you want to change topics from batch to batch, you can always just
>>> create a KafkaRDD repeatedly.
>>>
>>> The streaming code as it stands assumes a consistent set of topics
>>> though.  The implementation is private so you cant subclass it without
>>> building your own spark.
>>>
>>> On Wed, Apr 1, 2015 at 1:09 PM, Neelesh <neele...@gmail.com> wrote:
>>>
>>>> Thanks Cody, that was really helpful.  I have a much better
>>>> understanding now. One last question -  Kafka topics  are initialized once
>>>> in the driver, is there an easy way of adding/removing topics on the fly?
>>>> KafkaRDD#getPartitions() seems to be computed only once, and no way of
>>>> refreshing them.
>>>>
>>>> Thanks again!
>>>>
>>>> On Wed, Apr 1, 2015 at 10:01 AM, Cody Koeninger <c...@koeninger.org>
>>>> wrote:
>>>>
>>>>> https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
>>>>>
>>>>> The kafka consumers run in the executors.
>>>>>
>>>>> On Wed, Apr 1, 2015 at 11:18 AM, Neelesh <neele...@gmail.com> wrote:
>>>>>
>>>>>> With receivers, it was pretty obvious which code ran where - each
>>>>>> receiver occupied a core and ran on the workers. However, with the new
>>>>>> kafka direct input streams, its hard for me to understand where the code
>>>>>> that's reading from kafka brokers runs. Does it run on the driver (I hope
>>>>>> not), or does it run on workers?
>>>>>>
>>>>>> Any help appreciated
>>>>>> thanks!
>>>>>> -neelesh
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark Streaming 1.3 & Kafka Direct Streams

Reply via email to