(resending, text only as first post on 2nd never seemed to make it)

Using parallelize() on a dataset I'm only seeing two tasks rather than the number of cores in the Mesos cluster. This is with spark 1.5.1 and using the mesos coarse grained scheduler.

Running pyspark in a console seems to show that it's taking a while before the mesos executors come online (at which point the default parallelism is changing). If I add "sleep 30" after initialising the SparkContext I get the "right" number (42 by coincidence!)

I've just tried increasing minRegisteredResourcesRatio to 0.5 but this doesn't affect either the test case below nor my code.

Is there something else I can do instead? Perhaps it should be seeing how many tasks _should_ be available rather than how many are (I'm also using dynamicAllocation).

15/12/02 14:34:09 INFO mesos.CoarseMesosSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
>>>
>>>
>>> print (sc.defaultParallelism)
2
>>> 15/12/02 14:34:12 INFO mesos.CoarseMesosSchedulerBackend: Mesos task 5 is now TASK_RUNNING 15/12/02 14:34:13 INFO mesos.MesosExternalShuffleClient: Successfully registered app 20151117-115458-164233482-5050-24333-0126 with external shuffle service.
....
15/12/02 14:34:15 INFO mesos.CoarseMesosSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@ip-10-1-200-147.ec2.internal:41194/user/Executor#-1021429650]) with ID 20151117-115458-164233482-5050-24333-S22/5 15/12/02 14:34:15 INFO spark.ExecutorAllocationManager: New executor 20151117-115458-164233482-5050-24333-S22/5 has registered (new total is 1)
....
>>> print (sc.defaultParallelism)
42

Thanks

Adrian Bridgett

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to