from:"Nieyuan"

Re: Advantage of using cache()

2014-08-22 Thread Nieyuan

Because map-reduce tasks like join will save shuffle data to disk . So the only diffrence with caching or no-caching version is : .map { case (x, (n, i)) = (x, n)} - Thanks, Nieyuan -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using

Re: AppMaster OOME on YARN

2014-08-21 Thread Nieyuan

1.At begining of reduce task , mask will deliver map output info to every excutor. You can check stderr to find size of map output info . It should be : spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is xxx bytes 2.Erery excutor will processing 10+TB/2000 = 5G data.