I’m getting *huge* execution times on a moderate sized dataset during the RDD.isEmpty. Everything in the calculation is fast except an RDD.isEmpty calculation. I’m using Spark 1.5.1 and from researching I would expect this calculation to be linearly proportional to the number of partitions as a worst case, which should be a trivial amount of time but it is taking many minutes to hours to complete this single phase.
I know that has been a small amount of discussion about using this so would love to hear what the current thinking on the subject is. Is there a better way to find if an RDD has data? Can someone explain why this is happening? reference PR https://github.com/apache/spark/pull/4534 <https://github.com/apache/spark/pull/4534>