Those do quite different things. One counts the data; the other copies all of the data to the driver.
The fastest way to materialize an RDD that I know of is foreachPartition(i => None) (or equivalent no-op VoidFunction in Java) On Thu, Feb 26, 2015 at 1:28 PM, Emre Sevinc <emre.sev...@gmail.com> wrote: > Hello, > > I have a piece of code to force the materialization of RDDs in my Spark > Streaming program, and I'm trying to understand which method is faster and > has less memory consumption: > > javaDStream.foreachRDD(new Function<JavaRDD<String>, Void>() { > @Override > public Void call(JavaRDD<String> stringJavaRDD) throws Exception { > > //stringJavaRDD.collect(); > > // or count? > > //stringJavaRDD.count(); > > return null; > } > }); > > > I've checked the source code of Spark at > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala, > and see that collect() is defined as: > > def collect(): Array[T] = { > val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) > Array.concat(results: _*) > } > > and count() defined as: > > def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum > > Therefore I think calling the count() method is faster and/or consumes less > memory, but I wanted to be sure. > > Anyone cares to comment? > > > -- > Emre Sevinç > http://www.bigindustries.be/ > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org