Note I’m assuming you were going for the size of your RDD, meaning in the 
‘collect’ alternative, you would go for a size() right after the collect().

If you were simply trying to materialize your RDD, Sean’s answer is more 


>> The short answer:
>> count(), as the sum can be partially aggregated on the mappers.
>> The long answer:
>>> I have a piece of code to force the materialization of RDDs in my Spark
>>> Streaming program, and I'm trying to understand which method is faster and
>>> has less memory consumption:
>>>   javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() {
>>>       @Override
>>>       public Void call(JavaRDD<String> stringJavaRDD) throws Exception {
>>>         //stringJavaRDD.collect();
>>>        // or count?
>>>         //stringJavaRDD.count();
>>>         return null;
>>>       }
>>>     });
>>> I've checked the source code of Spark at
>>> and see that collect() is defined as:
>>>   def collect(): Array[T] = {
>>>     val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)
>>>    Array.concat(results: _*)
>>>   }
>>> and count() defined as:
>>>   def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
>>> Therefore I think calling the count() method is faster and/or consumes
>>> less memory, but I wanted to be sure.
>>> Anyone cares to comment?
