Don't collect() - that pulls all data into memory. Use count().

On Tue, Apr 19, 2022 at 5:34 AM wilson <i...@bigcount.xyz> wrote:

> Hello,
>
> Do you know for a big dataset why the general RDD job can be done, but
> the collect() failed due to memory overflow?
>
> for instance, for a dataset which has xxx million of items, this can be
> done well:
>
>   scala> rdd.map { x => x.split(",") }.map{ x => (x(5).toString,
> x(6).toDouble) }.groupByKey.mapValues(x =>
> x.sum/x.size).sortBy(-_._2).take(20)
>
>
> But in the final stage I issued this command and it got:
>
> scala> rdd.collect.size
> 22/04/19 18:26:52 ERROR Executor: Exception in task 13.0 in stage 44.0
> (TID 349)
> java.lang.OutOfMemoryError: Java heap space
>
>
> Thank you.
> wilson
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to