Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Kane Kim
I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:

http://pastebin.com/70M5d0Bn

Any ideas how I can fix that?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Large dataset, reduceByKey - java heap space error

2015-01-22 Thread Sean McNamara
Hi Kane-

http://spark.apache.org/docs/latest/tuning.html has excellent information that 
may be helpful.  In particular increasing the number of tasks may help, as well 
as confirming that you don’t have more data than you're expecting landing on a 
key.

Also, if you are using spark  1.2.0,  setting spark.shuffle.manager=sort was a 
huge help for many of our shuffle heavy workloads (this is the default in 1.2.0 
now)

Cheers,

Sean


On Jan 22, 2015, at 3:15 PM, Kane Kim 
kane.ist...@gmail.commailto:kane.ist...@gmail.com wrote:

I'm trying to process a large dataset, mapping/filtering works ok, but
as long as I try to reduceByKey, I get out of memory errors:

http://pastebin.com/70M5d0Bn

Any ideas how I can fix that?

Thanks.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org