Re: Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Cody Koeninger
Do you have a minimal example of how to reproduce the problem, that
doesn't depend on Cassandra?

On Mon, Sep 26, 2016 at 4:10 PM, Erwan ALLAIN  wrote:
> Hi
>
> I'm working with
> - Kafka 0.8.2
> - Spark Streaming (2.0) direct input stream.
> - cassandra 3.0
>
> My batch interval is 1s.
>
> When I use some map, filter even saveToCassandra functions, the processing
> time is around 50ms on empty batches
>  => This is fine.
>
> As soon as I use some reduceByKey, the processing time is increasing rapidly
> between 3 and 4s for 3 calls of reduceByKey on empty batches.
> => Not Good
>
> I've found a workaround by using a foreachRDD on DStream and check if rdd is
> empty before executing the reduceByKey but I find this quite ugly.
>
> Do I need to check if RDD is empty on all shuffle operation ?
>
> Thanks for your lights

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Slow Shuffle Operation on Empty Batch

2016-09-26 Thread Erwan ALLAIN
Hi

I'm working with
- Kafka 0.8.2
- Spark Streaming (2.0) direct input stream.
- cassandra 3.0

My batch interval is 1s.

When I use some map, filter even saveToCassandra functions, the processing
time is around 50ms on empty batches
 => This is fine.

As soon as I use some reduceByKey, the processing time is increasing rapidly
between 3 and 4s for 3 calls of reduceByKey on empty batches.
=> Not Good

I've found a workaround by using a foreachRDD on DStream and check if rdd
is empty before executing the reduceByKey but I find this quite ugly.

Do I need to check if RDD is empty on all shuffle operation ?

Thanks for your lights