Spark scheduling mode
I am building a Spark App, in which I submit several jobs (pyspark). I am using threads to run them in parallel, and also I am setting: conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run serially in FIFO way. Am I missing something? Cheers, Enrico
Re: Spark scheduling mode
Is there a way to force scheduling to be fair inside the default pool? I mean, round robin for the jobs that belong to the default pool. Cheers, From: Mark Hamstra Sent: Thursday, September 1, 2016 7:24:54 PM To: enrico d'urso Cc: user@spark.apache.org Subject: Re: Spark scheduling mode Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean that Spark can magically configure and start multiple scheduling pools for you, nor can it know to which pools you want jobs assigned. Without doing any setup of additional scheduling pools or assigning of jobs to pools, you're just dumping all of your jobs into the one available default pool (which is now being fair scheduled with an empty set of other pools) and the scheduling of jobs within that pool is still the default intra-pool scheduling, FIFO -- i.e., you've effectively accomplished nothing by only flipping spark.scheduler.mode to FAIR. On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso mailto:e.du...@live.com>> wrote: I am building a Spark App, in which I submit several jobs (pyspark). I am using threads to run them in parallel, and also I am setting: conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run serially in FIFO way. Am I missing something? Cheers, Enrico
Re: Spark scheduling mode
I tried it before, but still I am not able to see a proper round robin across the jobs I submit. Given this: FAIR 1 2 Each jobs inside production pool should be scheduled in round robin way, am I right? From: Mark Hamstra Sent: Thursday, September 1, 2016 8:19:44 PM To: enrico d'urso Cc: user@spark.apache.org Subject: Re: Spark scheduling mode The default pool (``) can be configured like any other pool: https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso mailto:e.du...@live.com>> wrote: Is there a way to force scheduling to be fair inside the default pool? I mean, round robin for the jobs that belong to the default pool. Cheers, From: Mark Hamstra mailto:m...@clearstorydata.com>> Sent: Thursday, September 1, 2016 7:24:54 PM To: enrico d'urso Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark scheduling mode Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean that Spark can magically configure and start multiple scheduling pools for you, nor can it know to which pools you want jobs assigned. Without doing any setup of additional scheduling pools or assigning of jobs to pools, you're just dumping all of your jobs into the one available default pool (which is now being fair scheduled with an empty set of other pools) and the scheduling of jobs within that pool is still the default intra-pool scheduling, FIFO -- i.e., you've effectively accomplished nothing by only flipping spark.scheduler.mode to FAIR. On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso mailto:e.du...@live.com>> wrote: I am building a Spark App, in which I submit several jobs (pyspark). I am using threads to run them in parallel, and also I am setting: conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run serially in FIFO way. Am I missing something? Cheers, Enrico
Re: Spark scheduling mode
Thank you. May I know when that comparator is called? It looks like spark scheduler has not any form of preemption, am I right? Thank you From: Mark Hamstra Sent: Thursday, September 1, 2016 8:44:10 PM To: enrico d'urso Cc: user@spark.apache.org Subject: Re: Spark scheduling mode Spark's FairSchedulingAlgorithm is not round robin: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulingAlgorithm.scala#L43 When at the scope of fair scheduling Jobs within a single Pool, the Schedulable entities being handled (s1 and s2) are TaskSetManagers, which are at the granularity of Stages, not Jobs. Since weight is 1 and minShare is 0 for TaskSetManagers, the FairSchedulingAlgorithm for TaskSetManagers just boils down to prioritizing TaskSets (i.e. Stages) with the fewest number of runningTasks. On Thu, Sep 1, 2016 at 11:23 AM, enrico d'urso mailto:e.du...@live.com>> wrote: I tried it before, but still I am not able to see a proper round robin across the jobs I submit. Given this: FAIR 1 2 Each jobs inside production pool should be scheduled in round robin way, am I right? From: Mark Hamstra mailto:m...@clearstorydata.com>> Sent: Thursday, September 1, 2016 8:19:44 PM To: enrico d'urso Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark scheduling mode The default pool (``) can be configured like any other pool: https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties On Thu, Sep 1, 2016 at 11:11 AM, enrico d'urso mailto:e.du...@live.com>> wrote: Is there a way to force scheduling to be fair inside the default pool? I mean, round robin for the jobs that belong to the default pool. Cheers, From: Mark Hamstra mailto:m...@clearstorydata.com>> Sent: Thursday, September 1, 2016 7:24:54 PM To: enrico d'urso Cc: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Re: Spark scheduling mode Just because you've flipped spark.scheduler.mode to FAIR, that doesn't mean that Spark can magically configure and start multiple scheduling pools for you, nor can it know to which pools you want jobs assigned. Without doing any setup of additional scheduling pools or assigning of jobs to pools, you're just dumping all of your jobs into the one available default pool (which is now being fair scheduled with an empty set of other pools) and the scheduling of jobs within that pool is still the default intra-pool scheduling, FIFO -- i.e., you've effectively accomplished nothing by only flipping spark.scheduler.mode to FAIR. On Thu, Sep 1, 2016 at 7:10 AM, enrico d'urso mailto:e.du...@live.com>> wrote: I am building a Spark App, in which I submit several jobs (pyspark). I am using threads to run them in parallel, and also I am setting: conf.set("spark.scheduler.mode", "FAIR") Still, I see the jobs run serially in FIFO way. Am I missing something? Cheers, Enrico