Hello, I have a piece of code to force the materialization of RDDs in my Spark Streaming program, and I'm trying to understand which method is faster and has less memory consumption:
javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() { @Override public Void call(JavaRDD<String> stringJavaRDD) throws Exception { //stringJavaRDD.collect(); // or count? //stringJavaRDD.count(); return null; } }); I've checked the source code of Spark at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala, and see that collect() is defined as: def collect(): Array[T] = { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) } and count() defined as: def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum Therefore I think calling the count() method is faster and/or consumes less memory, but I wanted to be sure. Anyone cares to comment? -- Emre Sevinç http://www.bigindustries.be/