Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?

I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.

So I see several inconsistencies I don't understand:

(1) If I run it a couple times, as separate applications (i.e., reloading,
recaching, etc), I will get different %'s cached each time.  I've got about
5x as much memory as I need overall, so it isn't running out.  But one
time, 100% of the data will be cached; the next, 83%, the next, 92%, etc.

(2) Also, the data is very unevenly distributed. I've got 400 partitions,
and 4 workers (with, I believe, 3x replication), and on my last run, my
distribution was 165/139/25/71.  Is there any way to get spark to
distribute the tasks more evenly?

(3) If I run the problem several times in the same execution (to take
advantage of caching etc.), I get very inconsistent results.  My latest
try, I get:

   - 1st run: 3.1 min
   - 2nd run: 2 seconds
   - 3rd run: 8 minutes
   - 4th run: 2 seconds
   - 5th run: 2 seconds
   - 6th run: 6.9 minutes
   - 7th run: 2 seconds
   - 8th run: 2 seconds
   - 9th run: 3.9 minuts
   - 10th run: 8 seconds

I understand the difference for the first run; it was caching that time.
Later times, when it manages to work in 2 seconds, it's because all the
tasks were PROCESS_LOCAL; when it takes longer, the last 10-20% of the
tasks end up with locality level ANY.  Why would that change when running
the exact same task twice in a row on cached data?

Any help or pointers that I could get would be much appreciated.


Thanks,

                 -Nathan



-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenf...@oculusinfo.com

Reply via email to