Re: distinct on huge dataset

Aaron Davidson Sun, 23 Mar 2014 20:03:06 -0700

Ah, interesting. count() without distinct is streaming and does not require
that a single partition fits in memory, for instance. That said, the
behavior may change if you increase the number of partitions in your input
RDD by using RDD.repartition()



On Sun, Mar 23, 2014 at 11:47 AM, Kane <kane.ist...@gmail.com> wrote:

> Yes, there was an error in data, after fixing it - count fails with Out of
> Memory Error.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3051.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: distinct on huge dataset

Reply via email to