I’m getting *huge* execution times on a moderate sized dataset during the 
RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty 
calculation. I’m using Spark 1.5.1 and from researching I would expect this 
calculation to be linearly proportional to the number of partitions as a worst 
case, which should be a trivial amount of time but it is taking many minutes to 
hours to complete this single phase.

I know that has been a small amount of discussion about using this so would 
love to hear what the current thinking on the subject is. Is there a better way 
to find if an RDD has data? Can someone explain why this is happening?

reference PR
https://github.com/apache/spark/pull/4534 
<https://github.com/apache/spark/pull/4534>

Reply via email to