Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I see several inconsistencies I don't understand: (1) If I run it a couple times, as separate applications (i.e., reloading, recaching, etc), I will get different %'s cached each time. I've got about 5x as much memory as I need overall, so it isn't running out. But one time, 100% of the data will be cached; the next, 83%, the next, 92%, etc. (2) Also, the data is very unevenly distributed. I've got 400 partitions, and 4 workers (with, I believe, 3x replication), and on my last run, my distribution was 165/139/25/71. Is there any way to get spark to distribute the tasks more evenly? (3) If I run the problem several times in the same execution (to take advantage of caching etc.), I get very inconsistent results. My latest try, I get: - 1st run: 3.1 min - 2nd run: 2 seconds - 3rd run: 8 minutes - 4th run: 2 seconds - 5th run: 2 seconds - 6th run: 6.9 minutes - 7th run: 2 seconds - 8th run: 2 seconds - 9th run: 3.9 minuts - 10th run: 8 seconds I understand the difference for the first run; it was caching that time. Later times, when it manages to work in 2 seconds, it's because all the tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the tasks end up with locality level ANY. Why would that change when running the exact same task twice in a row on cached data? Any help or pointers that I could get would be much appreciated. Thanks, -Nathan -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com