Re: distinct on huge dataset

Aaron Davidson Sat, 22 Mar 2014 12:41:27 -0700

This could be related to the hash collision bug in ExternalAppendOnlyMap in
0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045

You might try setting spark.shuffle.spill to false and see if that runs any
longer (turning off shuffle spill is dangerous, though, as it may cause
Spark to OOM if your reduce partitions are too large).

On Sat, Mar 22, 2014 at 10:00 AM, Kane <kane.ist...@gmail.com> wrote:

> I mean everything works with the small file. With huge file only count and
> map work, distinct - doesn't
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Re: distinct on huge dataset

Reply via email to