This could be related to the hash collision bug in ExternalAppendOnlyMap in 0.9.0: https://spark-project.atlassian.net/browse/SPARK-1045
You might try setting spark.shuffle.spill to false and see if that runs any longer (turning off shuffle spill is dangerous, though, as it may cause Spark to OOM if your reduce partitions are too large). On Sat, Mar 22, 2014 at 10:00 AM, Kane <kane.ist...@gmail.com> wrote: > I mean everything works with the small file. With huge file only count and > map work, distinct - doesn't > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/distinct-on-huge-dataset-tp3025p3034.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >