(resending, text only as first post on 2nd never seemed to make it)
Using parallelize() on a dataset I'm only seeing two tasks rather than
the number of cores in the Mesos cluster. This is with spark 1.5.1 and
using the mesos coarse grained scheduler.
Running pyspark in a console seems to show that it's taking a while
before the mesos executors come online (at which point the default
parallelism is changing). If I add "sleep 30" after initialising the
SparkContext I get the "right" number (42 by coincidence!)
I've just tried increasing minRegisteredResourcesRatio to 0.5 but this
doesn't affect either the test case below nor my code.
Is there something else I can do instead? Perhaps it should be seeing
how many tasks _should_ be available rather than how many are (I'm also
using dynamicAllocation).
15/12/02 14:34:09 INFO mesos.CoarseMesosSchedulerBackend:
SchedulerBackend is ready for scheduling beginning after reached
minRegisteredResourcesRatio: 0.0
>>>
>>>
>>> print (sc.defaultParallelism)
2
>>> 15/12/02 14:34:12 INFO mesos.CoarseMesosSchedulerBackend: Mesos
task 5 is now TASK_RUNNING
15/12/02 14:34:13 INFO mesos.MesosExternalShuffleClient: Successfully
registered app 20151117-115458-164233482-5050-24333-0126 with external
shuffle service.
....
15/12/02 14:34:15 INFO mesos.CoarseMesosSchedulerBackend: Registered
executor:
AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@ip-10-1-200-147.ec2.internal:41194/user/Executor#-1021429650])
with ID 20151117-115458-164233482-5050-24333-S22/5
15/12/02 14:34:15 INFO spark.ExecutorAllocationManager: New executor
20151117-115458-164233482-5050-24333-S22/5 has registered (new total is 1)
....
>>> print (sc.defaultParallelism)
42
Thanks
Adrian Bridgett
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org