Spark's scheduling is pretty simple: it will allocate tasks to open cores on executors, preferring ones where the data is local. It even performs "delay scheduling", which means waiting a bit to see if an executor where the data resides locally becomes available.
Are yours tasks seeing very skewed execution times? If some tasks are taking a very long time and using all the resources on a node, perhaps the other nodes are quickly finishing many tasks, and actually overpopulating their caches. If a particular machine were not overpopulating its cache, and there are no failures, then you should see 100% cached after the first run. It's also strange that running totally uncached takes 3.1 minutes, but running 80-90% cached may take 8 minutes. Does your workload produce nondeterministic variance in task times? Was it a single straggler, or many tasks, that was keeping the job from finishing? It's not too uncommon to see occasional performance regressions while caching due to GC, though 2 seconds to 8 minutes is a bit extreme. On Wed, Nov 12, 2014 at 9:01 PM, Nathan Kronenfeld < nkronenf...@oculusinfo.com> wrote: > Sorry, I think I was not clear in what I meant. > I didn't mean it went down within a run, with the same instance. > > I meant I'd run the whole app, and one time, it would cache 100%, and the > next run, it might cache only 83% > > Within a run, it doesn't change. > > On Wed, Nov 12, 2014 at 11:31 PM, Aaron Davidson <ilike...@gmail.com> > wrote: > >> The fact that the caching percentage went down is highly suspicious. It >> should generally not decrease unless other cached data took its place, or >> if unless executors were dying. Do you know if either of these were the >> case? >> >> On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < >> nkronenf...@oculusinfo.com> wrote: >> >>> Can anyone point me to a good primer on how spark decides where to send >>> what task, how it distributes them, and how it determines data locality? >>> >>> I'm trying a pretty simple task - it's doing a foreach over cached data, >>> accumulating some (relatively complex) values. >>> >>> So I see several inconsistencies I don't understand: >>> >>> (1) If I run it a couple times, as separate applications (i.e., >>> reloading, recaching, etc), I will get different %'s cached each time. >>> I've got about 5x as much memory as I need overall, so it isn't running >>> out. But one time, 100% of the data will be cached; the next, 83%, the >>> next, 92%, etc. >>> >>> (2) Also, the data is very unevenly distributed. I've got 400 >>> partitions, and 4 workers (with, I believe, 3x replication), and on my last >>> run, my distribution was 165/139/25/71. Is there any way to get spark to >>> distribute the tasks more evenly? >>> >>> (3) If I run the problem several times in the same execution (to take >>> advantage of caching etc.), I get very inconsistent results. My latest >>> try, I get: >>> >>> - 1st run: 3.1 min >>> - 2nd run: 2 seconds >>> - 3rd run: 8 minutes >>> - 4th run: 2 seconds >>> - 5th run: 2 seconds >>> - 6th run: 6.9 minutes >>> - 7th run: 2 seconds >>> - 8th run: 2 seconds >>> - 9th run: 3.9 minuts >>> - 10th run: 8 seconds >>> >>> I understand the difference for the first run; it was caching that >>> time. Later times, when it manages to work in 2 seconds, it's because all >>> the tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the >>> tasks end up with locality level ANY. Why would that change when running >>> the exact same task twice in a row on cached data? >>> >>> Any help or pointers that I could get would be much appreciated. >>> >>> >>> Thanks, >>> >>> -Nathan >>> >>> >>> >>> -- >>> Nathan Kronenfeld >>> Senior Visualization Developer >>> Oculus Info Inc >>> 2 Berkeley Street, Suite 600, >>> Toronto, Ontario M5A 4J5 >>> Phone: +1-416-203-3003 x 238 >>> Email: nkronenf...@oculusinfo.com >>> >> >> > > > -- > Nathan Kronenfeld > Senior Visualization Developer > Oculus Info Inc > 2 Berkeley Street, Suite 600, > Toronto, Ontario M5A 4J5 > Phone: +1-416-203-3003 x 238 > Email: nkronenf...@oculusinfo.com >