The fact that the caching percentage went down is highly suspicious. It
should generally not decrease unless other cached data took its place, or
if unless executors were dying. Do you know if either of these were the
case?

On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
nkronenf...@oculusinfo.com> wrote:

> Can anyone point me to a good primer on how spark decides where to send
> what task, how it distributes them, and how it determines data locality?
>
> I'm trying a pretty simple task - it's doing a foreach over cached data,
> accumulating some (relatively complex) values.
>
> So I see several inconsistencies I don't understand:
>
> (1) If I run it a couple times, as separate applications (i.e., reloading,
> recaching, etc), I will get different %'s cached each time.  I've got about
> 5x as much memory as I need overall, so it isn't running out.  But one
> time, 100% of the data will be cached; the next, 83%, the next, 92%, etc.
>
> (2) Also, the data is very unevenly distributed. I've got 400 partitions,
> and 4 workers (with, I believe, 3x replication), and on my last run, my
> distribution was 165/139/25/71.  Is there any way to get spark to
> distribute the tasks more evenly?
>
> (3) If I run the problem several times in the same execution (to take
> advantage of caching etc.), I get very inconsistent results.  My latest
> try, I get:
>
>    - 1st run: 3.1 min
>    - 2nd run: 2 seconds
>    - 3rd run: 8 minutes
>    - 4th run: 2 seconds
>    - 5th run: 2 seconds
>    - 6th run: 6.9 minutes
>    - 7th run: 2 seconds
>    - 8th run: 2 seconds
>    - 9th run: 3.9 minuts
>    - 10th run: 8 seconds
>
> I understand the difference for the first run; it was caching that time.
> Later times, when it manages to work in 2 seconds, it's because all the
> tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the
> tasks end up with locality level ANY.  Why would that change when running
> the exact same task twice in a row on cached data?
>
> Any help or pointers that I could get would be much appreciated.
>
>
> Thanks,
>
>                  -Nathan
>
>
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenf...@oculusinfo.com
>

Reply via email to