Ah, interesting. count() without distinct is streaming and does not require that a single partition fits in memory, for instance. That said, the behavior may change if you increase the number of partitions in your input RDD by using RDD.repartition()
On Sun, Mar 23, 2014 at 11:47 AM, Kane <kane.ist...@gmail.com> wrote: > Yes, there was an error in data, after fixing it - count fails with Out of > Memory Error. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >