Because map-reduce tasks like join will save shuffle data to disk . So the
only diffrence with caching or no-caching version is :
.map { case (x, (n, i)) = (x, n)}
-
Thanks,
Nieyuan
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Advantage-of-using
1.At begining of reduce task , mask will deliver map output info to every
excutor. You can check stderr to find size of map output info . It should be
:
spark.MapOutputTrackerMaster: Size of output statuses for shuffle 0 is
xxx bytes
2.Erery excutor will processing 10+TB/2000 = 5G data.