Thank you, Daniel and Yong! On Wed, Jan 18, 2017 at 4:56 PM, Daniel Siegmann < dsiegm...@securityscorecard.io> wrote:
> I am not too familiar with Spark Standalone, so unfortunately I cannot > give you any definite answer. I do want to clarify something though. > > The properties spark.sql.shuffle.partitions and spark.default.parallelism > affect how your data is split up, which will determine the *total* number > of tasks, *NOT* the number of tasks being run in parallel. Except of > course you will never run more tasks in parallel than there are total, so > if your data is small you might be able to control it via these parameters > - but that wouldn't typically be how you'd use these parameters. > > On YARN as you noted there is spark.executor.instances as well as > spark.executor.cores, and you'd multiple them to determine the maximum > number of tasks that would run in parallel on your cluster. But there is no > guarantee the executors would be distributed evenly across nodes. > > Unfortunately I'm not familiar with how this works on Spark Standalone. > Your expectations seem reasonable to me. Sorry I can't be helpful, > hopefully someone else will be able to explain exactly how this works. > -- Saliya Ekanayake, Ph.D Applied Computer Scientist Network Dynamics and Simulation Science Laboratory (NDSSL) Virginia Tech, Blacksburg