@Brad, I'm guessing that the additional memory usage is coming from the shuffle performed by coalesce, so that at least explains the memory blowup.
On Sun, Jan 4, 2015 at 10:16 PM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > You can try: > > - Using KryoSerializer > - Enabling RDD Compression > - Setting storage type to MEMORY_ONLY_SER or MEMORY_AND_DISK_SER > > > Thanks > Best Regards > > On Sun, Jan 4, 2015 at 11:53 PM, Brad Willard <bradwill...@gmail.com> > wrote: > >> I have a 10 node cluster with 600gb of ram. I'm loading a fairly large >> dataset from json files. When I load the dataset it is about 200gb however >> it only creates 60 partitions. I'm trying to repartition to 256 to >> increase >> cpu utilization however when I do that it balloons in memory to way over >> 2x >> the initial size killing nodes from memory failures. >> >> >> https://s3.amazonaws.com/f.cl.ly/items/3k2n2n3j35273i2v1Y3t/Screen%20Shot%202015-01-04%20at%201.20.29%20PM.png >> >> Is this a bug? How can I work around this. >> >> Thanks! >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Repartition-Memory-Leak-tp20965.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >