Hello, 

I'm new to Spark and trying to understand how exactly spark scheduler works.

In the article /"Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing"/ in section  5.1 Job
Scheduling" its said that:
/
"Whenever a user runs an action (e.g., count or save) on an RDD, the
scheduler examines that RDD’s lineage graph to build a DAG of stages to
execute, as illustrated in Figure 5. Each stage contains as many pipelined
transformations with narrow dependencies as possible. The boundaries of the
stages are the shuffle operations required for wide dependencies, or any
already computed partitions that can shortcircuit the computation of a
parent RDD. The scheduler then launches tasks to compute missing partitions
from each stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on data locality using delay
scheduling [32]. If a task needs to process a partition that is available in
memory on a node, we send it to that node. Otherwise, if a task processes a
partition for which the containing RDD provides preferred locations (e.g.,
an HDFS file), we send it to those. "/

After reading the gitbook/ "Mastering Apache Spark"/ by Jacek Laskowski and
some of the Spark's code, what I have understand about schedulling on spark
is this:

When an action is performed onto a RDD, Spark send it as a job to the
DAGScheduler;
The DAGScheduler compute the execution DAG based on the RDD's lineage, and
split the job into stages (using wide dependencies);
The resulting stages are transformed into a set of tasks, that are sent to
the TaskScheduler;
The TaskScheduler send the set of tasks to the executors, where they will
run.

Is this flow correct?

And are the jobs  discovered during the application execution and sent
sequentially to the DAGScheduler?

In the file /DAGScheduler.scala/ there's this comment:
/
* The high-level scheduling layer that implements stage-oriented scheduling.
It computes a DAG of
* stages for each job, keeps track of which RDDs and stage outputs are
materialized, *and finds a*
* *minimal schedule to run the job*. It then submits stages as TaskSets to
an underlying
* TaskScheduler implementation that runs them on the cluster. A TaskSet
contains fully independent
* tasks that can run right away based on the data that's already on the
cluster (e.g. map output
* files from previous stages), though it may fail if this data becomes
unavailable.
/

Regarding this part /"finds a minimal schedule to run the job"/, I have not
found this algorithm for getting the minimal schedule. Can you help me?


And, based on these comments:

File /TaskScheduler.scala/
/* Low-level task scheduler interface, currently implemented exclusively by
* [[org.apache.spark.scheduler.TaskSchedulerImpl]].
* This interface allows plugging in different task schedulers. Each
TaskScheduler schedules tasks
* for a single SparkContext. These schedulers get sets of tasks submitted to
them from the
* DAGScheduler for each stage, and are responsible for sending the tasks to
the cluster, running
* them, retrying if there are failures, and mitigating stragglers. They
return events to the
* DAGScheduler.
/
File/ TaskSchedulerImpl.scala/
/* Schedules tasks for multiple types of clusters by acting through a
SchedulerBackend.
/
File /SchedulerBackend.scala
/**
* A backend interface for scheduling systems that allows plugging in
different ones under
* TaskSchedulerImpl. We assume a Mesos-like model where the application gets
resource offers as
* machines become available and can launch tasks on them.
*/
/

And this from Spark docs:

/"Scheduling Within an Application
Inside a given Spark application (SparkContext instance), multiple parallel
jobs can run simultaneously if they were submitted from separate threads. By
“job”, in this section, we mean a Spark action (e.g. save, collect) and any
tasks that need to run to evaluate that action. Spark’s scheduler is fully
thread-safe and supports this use case to enable applications that serve
multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided
into “stages” (e.g. map and reduce phases), and the first job gets priority
on all available resources while its stages have tasks to launch, then the
second job gets priority, etc. If the jobs at the head of the queue don’t
need to use the whole cluster, later jobs can start to run right away, but
if the jobs at the head of the queue are large, then later jobs may be
delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between
jobs. Under fair sharing, Spark assigns tasks between jobs in a “round
robin” fashion, so that all jobs get a roughly equal share of cluster
resources. This means that short jobs submitted while a long job is running
can start receiving resources right away and still get good response times,
without waiting for the long job to finish. This mode is best for multi-user
settings."/

I'm in doubt if the scheduling is at Task level, job level, or both. These
scheduling modes: FIFO and FAIR, are for tasks or jobs?

Also, as the TaskScheduler is an interface, is possible to "plug" different
scheduling algorithms to it, correct?

But what about the DAGScheduler, is there any interface that allows plugging
different scheduling algorithms to it?

In the video "Introduction to AmpLab Spark Internals" its said that
pluggable inter job scheduling is a possible future extension. Anyone knows
if this has already been addressed ?

I'm starting a master degree and I'd really like to contribute to Spark. Are
there suggestions of issues in the spark scheduling that could be
addressed??

Best,
Miguel



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to