Re: Skip empty batches - spark streaming

Cody Koeninger Thu, 11 Feb 2016 13:47:17 -0800

Please don't change the behavior of DirectKafkaInputDStream.
Returning an empty rdd is (imho) the semantically correct thing to do, and
some existing jobs depend on that behavior.


If it's really an issue for you, you can either override
directkafkainputdstream, or just check isEmpty as the first thing you do
with the rdd (before any transformations)

In any recent version of spark, isEmpty on a KafkaRDD is a driver-side only
operation that is basically free.


On Thu, Feb 11, 2016 at 3:19 PM, Sebastian Piu <sebastian....@gmail.com>
wrote:

> Yes, and as far as I recall it also has partitions (empty) which screws up
> the isEmpty call if the rdd has been transformed down the line. I will have
> a look tomorrow at the office and see if I can collaborate
> On 11 Feb 2016 9:14 p.m., "Shixiong(Ryan) Zhu" <shixi...@databricks.com>
> wrote:
>
>> Yeah, DirectKafkaInputDStream always returns a RDD even if it's empty.
>> Feel free to send a PR to improve it.
>>
>> On Thu, Feb 11, 2016 at 1:09 PM, Sebastian Piu <sebastian....@gmail.com>
>> wrote:
>>
>>> I'm using the Kafka direct stream api but I can have a look on extending
>>> it to have this behaviour
>>>
>>> Thanks!
>>> On 11 Feb 2016 9:07 p.m., "Shixiong(Ryan) Zhu" <shixi...@databricks.com>
>>> wrote:
>>>
>>>> Are you using a custom input dstream? If so, you can make the `compute`
>>>> method return None to skip a batch.
>>>>
>>>> On Thu, Feb 11, 2016 at 1:03 PM, Sebastian Piu <sebastian....@gmail.com
>>>> > wrote:
>>>>
>>>>> I was wondering if there is there any way to skip batches with zero
>>>>> events when streaming?
>>>>> By skip I mean avoid the empty rdd from being created at all?
>>>>>
>>>>
>>>>
>>

Re: Skip empty batches - spark streaming

Reply via email to