Hello,

I have a piece of code to force the materialization of RDDs in my Spark
Streaming program, and I'm trying to understand which method is faster and
has less memory consumption:

  javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {
      @Override
      public Void call(JavaRDD<String> stringJavaRDD) throws Exception {

        //stringJavaRDD.collect();

       // or count?

        //stringJavaRDD.count();

        return null;
      }
    });


I've checked the source code of Spark at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala,
and see that collect() is defined as:

  def collect(): Array[T] = {
    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
   Array.concat(results: _*)
  }

and count() defined as:

  def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

Therefore I think calling the count() method is faster and/or consumes less
memory, but I wanted to be sure.

Anyone cares to comment?


-- 
Emre Sevinç
http://www.bigindustries.be/

Reply via email to