Sorry, I think I was not clear in what I meant. I didn't mean it went down within a run, with the same instance.
I meant I'd run the whole app, and one time, it would cache 100%, and the next run, it might cache only 83% Within a run, it doesn't change. On Wed, Nov 12, 2014 at 11:31 PM, Aaron Davidson <ilike...@gmail.com> wrote: > The fact that the caching percentage went down is highly suspicious. It > should generally not decrease unless other cached data took its place, or > if unless executors were dying. Do you know if either of these were the > case? > > On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > >> Can anyone point me to a good primer on how spark decides where to send >> what task, how it distributes them, and how it determines data locality? >> >> I'm trying a pretty simple task - it's doing a foreach over cached data, >> accumulating some (relatively complex) values. >> >> So I see several inconsistencies I don't understand: >> >> (1) If I run it a couple times, as separate applications (i.e., >> reloading, recaching, etc), I will get different %'s cached each time. >> I've got about 5x as much memory as I need overall, so it isn't running >> out. But one time, 100% of the data will be cached; the next, 83%, the >> next, 92%, etc. >> >> (2) Also, the data is very unevenly distributed. I've got 400 partitions, >> and 4 workers (with, I believe, 3x replication), and on my last run, my >> distribution was 165/139/25/71. Is there any way to get spark to >> distribute the tasks more evenly? >> >> (3) If I run the problem several times in the same execution (to take >> advantage of caching etc.), I get very inconsistent results. My latest >> try, I get: >> >> - 1st run: 3.1 min >> - 2nd run: 2 seconds >> - 3rd run: 8 minutes >> - 4th run: 2 seconds >> - 5th run: 2 seconds >> - 6th run: 6.9 minutes >> - 7th run: 2 seconds >> - 8th run: 2 seconds >> - 9th run: 3.9 minuts >> - 10th run: 8 seconds >> >> I understand the difference for the first run; it was caching that time. >> Later times, when it manages to work in 2 seconds, it's because all the >> tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the >> tasks end up with locality level ANY. Why would that change when running >> the exact same task twice in a row on cached data? >> >> Any help or pointers that I could get would be much appreciated. >> >> >> Thanks, >> >> -Nathan >> >> >> >> -- >> Nathan Kronenfeld >> Senior Visualization Developer >> Oculus Info Inc >> 2 Berkeley Street, Suite 600, >> Toronto, Ontario M5A 4J5 >> Phone: +1-416-203-3003 x 238 >> Email: nkronenf...@oculusinfo.com >> > > -- Nathan Kronenfeld Senior Visualization Developer Oculus Info Inc 2 Berkeley Street, Suite 600, Toronto, Ontario M5A 4J5 Phone: +1-416-203-3003 x 238 Email: nkronenf...@oculusinfo.com