@Brad, I'm guessing that the additional memory usage is coming from the
shuffle performed by coalesce, so that at least explains the memory blowup.

On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> You can try:
>
> - Using KryoSerializer
> - Enabling RDD Compression
> - Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER
>
>
> Thanks
> Best Regards
>
> On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard <bradwill...@gmail.com>
> wrote:
>
>> I have a 10 node cluster with 600gb of ram. I'm loading a fairly large
>> dataset from json files. When I load the dataset it is about 200gb however
>> it only creates 60 partitions. I'm trying to repartition to 256 to
>> increase
>> cpu utilization however when I do that it balloons in memory to way over
>> 2x
>> the initial size killing nodes from memory failures.
>>
>>
>> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png
>>
>> Is this a bug? How can I work around this.
>>
>> Thanks!
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-Memory-Leak-tp20965.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to