Re: [DISCUSS] Support decimals with negative scale in decimal operation

2019-01-06 Thread Wenchen Fan
I think we need to do this for backward compatibility, and according to the
discussion in the doc, SQL standard allows negative scale.

To do this, I think the PR should also include a doc for the decimal type,
like the definition of precision and scale(this one

looks pretty good), and the result type of decimal operations, and the
behavior when writing out decimals(e.g. we can cast decimal(1, -20) to
decimal(20, 0) before writing).

Another question is, shall we set a min scale? e.g. shall we allow
decimal(1, -1000)?

On Thu, Oct 25, 2018 at 9:49 PM Marco Gaido  wrote:

> Hi all,
>
> a bit more than one month ago, I sent a proposal for handling properly
> decimals with negative scales in our operations. This is a long standing
> problem in our codebase as we derived our rules from Hive and SQLServer
> where negative scales are forbidden, while in Spark they are not.
>
> The discussion has been stale for a while now. No more comments on the
> design doc:
> https://docs.google.com/document/d/17ScbMXJ83bO9lx8hB_jeJCSryhT9O_HDEcixDq0qmPk/edit#heading=h.x7062zmkubwm
> .
>
> So I am writing this e-mail in order to check whether there are more
> comments on it or we can go ahead with the PR.
>
> Thanks,
> Marco
>


Re: Spark Packaging Jenkins

2019-01-06 Thread Dongjoon Hyun
Thank you, Felix and Shane.

+1 for packaging except signing in Jenkins. That sounds perfect for the
next releases (2.2.3, 2.3.3, 3.0.0).

BTW, thanks to Shane, the followings are recovered from today.

1. For all live branches (master to branch-2.2), Jenkins Packaging jobs are
recovered to build and packaging. Only there exist signing errors.

  - Master:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-package/2181/console
  - branch-2.4:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.4-package/95/console
  - branch-2.3:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.3-package/230/console
  - branch-2.2:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-branch-2.2-package/240/console

2. Daily snapshot maven publishing is recovered. This is helpful for
partial module testing.

 -
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/

Thank you all. Everything becomes healthier during 2.2.3 release.
This will reduce the burden of 2.3.3/3.0.0 releases. :D

Bests,
Dongjoon.


On Sun, Jan 6, 2019 at 11:42 AM Felix Cheung 
wrote:

> Awesome Shane!
>
>
> --
> *From:* shane knapp 
> *Sent:* Sunday, January 6, 2019 11:38 AM
> *To:* Felix Cheung
> *Cc:* Dongjoon Hyun; Wenchen Fan; dev
> *Subject:* Re: Spark Packaging Jenkins
>
> noted.  i like the idea of building (but not signing) the release and will
> update the job(s) this week.
>
> On Sun, Jan 6, 2019 at 11:22 AM Felix Cheung 
> wrote:
>
>> https://spark.apache.org/release-process.html
>>
>> Look for do-release-docker.sh script
>>
>>
>> --
>> *From:* Felix Cheung 
>> *Sent:* Sunday, January 6, 2019 11:17 AM
>> *To:* Dongjoon Hyun; Wenchen Fan
>> *Cc:* dev; shane knapp
>> *Subject:* Re: Spark Packaging Jenkins
>>
>> The release process doc should have been updated on this - as mentioned
>> we do not use Jenkins for release signing (take this offline if further
>> discussion is needed)
>>
>> The release build on Jenkins can still be useful for pre-validating the
>> release build process (without actually signing it)
>>
>>
>> --
>> *From:* Dongjoon Hyun 
>> *Sent:* Saturday, January 5, 2019 9:46 PM
>> *To:* Wenchen Fan
>> *Cc:* dev; shane knapp
>> *Subject:* Re: Spark Packaging Jenkins
>>
>> Thank you, Wenchen.
>>
>> I see. I'll update the doc and proceed to the next step manually as you
>> advise. And it seems that we can stop the outdated Jenkins jobs, too.
>>
>> Bests,
>> Dongjoon.
>>
>> On Sat, Jan 5, 2019 at 20:15 Wenchen Fan  wrote:
>>
>>> IIRC there was a change to the release process: we stop using the shared
>>> gpg key on Jenkins, but use the personal key of the release manager. I'm
>>> not sure Jenkins can help testing package anymore.
>>>
>>> BTW release manager needs to run the packaging script by himself. If
>>> there is a problem, the release manager will find it out sooner or later.
>>>
>>>
>>>
>>> On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, All.

 It turns out that `gpg signing` is the next huddle in Spark Packaging
 Jenkins.
 Since 2.4.0 release, is there something changed in our Jenkins machine?

   gpg: skipped
 "/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
   gpg: signing failed: No secret key

 Bests,
 Dongjoon.


 On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
 wrote:

> https://issues.apache.org/jira/browse/SPARK-26537
>
> On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
> wrote:
>
>> this may push in to early next week...  these builds were set up
>> before my time, and i'm currently unraveling how they all work before
>> pushing a commit to fix stuff.
>>
>> nothing like some code archaeology to make my friday more exciting!
>> :)
>>
>> shane
>>
>> On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>>
>>> Thank you, Shane!
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
>>> wrote:
>>>
 yeah, i'll get on that today.  thanks for the heads up.

 On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun <
 dongjoon.h...@gmail.com> wrote:

> Hi, All
>
> As a part of release process, we need to check
> Packaging/Compile/Test Jenkins status.
>
> http://spark.apache.org/release-process.html
>
> 1. Spark Packaging:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
> 2. Spark QA Compile:
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
> 3. Spark QA Test:
> 

Re: Spark Packaging Jenkins

2019-01-06 Thread Felix Cheung
Awesome Shane!



From: shane knapp 
Sent: Sunday, January 6, 2019 11:38 AM
To: Felix Cheung
Cc: Dongjoon Hyun; Wenchen Fan; dev
Subject: Re: Spark Packaging Jenkins

noted.  i like the idea of building (but not signing) the release and will 
update the job(s) this week.

On Sun, Jan 6, 2019 at 11:22 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
https://spark.apache.org/release-process.html

Look for do-release-docker.sh script



From: Felix Cheung mailto:felixcheun...@hotmail.com>>
Sent: Sunday, January 6, 2019 11:17 AM
To: Dongjoon Hyun; Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

The release process doc should have been updated on this - as mentioned we do 
not use Jenkins for release signing (take this offline if further discussion is 
needed)

The release build on Jenkins can still be useful for pre-validating the release 
build process (without actually signing it)



From: Dongjoon Hyun mailto:dongjoon.h...@gmail.com>>
Sent: Saturday, January 5, 2019 9:46 PM
To: Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

Thank you, Wenchen.

I see. I'll update the doc and proceed to the next step manually as you advise. 
And it seems that we can stop the outdated Jenkins jobs, too.

Bests,
Dongjoon.

On Sat, Jan 5, 2019 at 20:15 Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
IIRC there was a change to the release process: we stop using the shared gpg 
key on Jenkins, but use the personal key of the release manager. I'm not sure 
Jenkins can help testing package anymore.

BTW release manager needs to run the packaging script by himself. If there is a 
problem, the release manager will find it out sooner or later.



On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

It turns out that `gpg signing` is the next huddle in Spark Packaging Jenkins.
Since 2.4.0 release, is there something changed in our Jenkins machine?

  gpg: skipped 
"/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
  gpg: signing failed: No secret key

Bests,
Dongjoon.


On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
https://issues.apache.org/jira/browse/SPARK-26537

On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
this may push in to early next week...  these builds were set up before my 
time, and i'm currently unraveling how they all work before pushing a commit to 
fix stuff.

nothing like some code archaeology to make my friday more exciting!  :)

shane

On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane!

Bests,
Dongjoon.

On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
yeah, i'll get on that today.  thanks for the heads up.

On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All

As a part of release process, we need to check Packaging/Compile/Test Jenkins 
status.

http://spark.apache.org/release-process.html

1. Spark Packaging: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
2. Spark QA Compile: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
3. Spark QA Test: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

Currently, (2) and (3) are working because it uses GitHub 
(https://github.com/apache/spark.git).
But, (1) seems to be broken because it's looking for old 
repo(https://git-wip-us.apache.org/repos/asf/spark.git/info/refs) instead of 
new GitBox.

Can we fix this in this week?

Bests,
Dongjoon.



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Spark Packaging Jenkins

2019-01-06 Thread shane knapp
noted.  i like the idea of building (but not signing) the release and will
update the job(s) this week.

On Sun, Jan 6, 2019 at 11:22 AM Felix Cheung 
wrote:

> https://spark.apache.org/release-process.html
>
> Look for do-release-docker.sh script
>
>
> --
> *From:* Felix Cheung 
> *Sent:* Sunday, January 6, 2019 11:17 AM
> *To:* Dongjoon Hyun; Wenchen Fan
> *Cc:* dev; shane knapp
> *Subject:* Re: Spark Packaging Jenkins
>
> The release process doc should have been updated on this - as mentioned we
> do not use Jenkins for release signing (take this offline if further
> discussion is needed)
>
> The release build on Jenkins can still be useful for pre-validating the
> release build process (without actually signing it)
>
>
> --
> *From:* Dongjoon Hyun 
> *Sent:* Saturday, January 5, 2019 9:46 PM
> *To:* Wenchen Fan
> *Cc:* dev; shane knapp
> *Subject:* Re: Spark Packaging Jenkins
>
> Thank you, Wenchen.
>
> I see. I'll update the doc and proceed to the next step manually as you
> advise. And it seems that we can stop the outdated Jenkins jobs, too.
>
> Bests,
> Dongjoon.
>
> On Sat, Jan 5, 2019 at 20:15 Wenchen Fan  wrote:
>
>> IIRC there was a change to the release process: we stop using the shared
>> gpg key on Jenkins, but use the personal key of the release manager. I'm
>> not sure Jenkins can help testing package anymore.
>>
>> BTW release manager needs to run the packaging script by himself. If
>> there is a problem, the release manager will find it out sooner or later.
>>
>>
>>
>> On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> It turns out that `gpg signing` is the next huddle in Spark Packaging
>>> Jenkins.
>>> Since 2.4.0 release, is there something changed in our Jenkins machine?
>>>
>>>   gpg: skipped
>>> "/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
>>> No secret key
>>>   gpg: signing failed: No secret key
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Jan 4, 2019 at 11:52 AM shane knapp  wrote:
>>>
 https://issues.apache.org/jira/browse/SPARK-26537

 On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
 wrote:

> this may push in to early next week...  these builds were set up
> before my time, and i'm currently unraveling how they all work before
> pushing a commit to fix stuff.
>
> nothing like some code archaeology to make my friday more exciting!  :)
>
> shane
>
> On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, Shane!
>>
>> Bests,
>> Dongjoon.
>>
>> On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
>> wrote:
>>
>>> yeah, i'll get on that today.  thanks for the heads up.
>>>
>>> On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun <
>>> dongjoon.h...@gmail.com> wrote:
>>>
 Hi, All

 As a part of release process, we need to check
 Packaging/Compile/Test Jenkins status.

 http://spark.apache.org/release-process.html

 1. Spark Packaging:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
 2. Spark QA Compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
 3. Spark QA Test:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

 Currently, (2) and (3) are working because it uses GitHub (
 https://github.com/apache/spark.git).
 But, (1) seems to be broken because it's looking for old repo(
 https://git-wip-us.apache.org/repos/asf/spark.git/info/refs)
 instead of new GitBox.

 Can we fix this in this week?

 Bests,
 Dongjoon.


>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


 --
 Shane Knapp
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Spark Packaging Jenkins

2019-01-06 Thread Felix Cheung
https://spark.apache.org/release-process.html

Look for do-release-docker.sh script



From: Felix Cheung 
Sent: Sunday, January 6, 2019 11:17 AM
To: Dongjoon Hyun; Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

The release process doc should have been updated on this - as mentioned we do 
not use Jenkins for release signing (take this offline if further discussion is 
needed)

The release build on Jenkins can still be useful for pre-validating the release 
build process (without actually signing it)



From: Dongjoon Hyun 
Sent: Saturday, January 5, 2019 9:46 PM
To: Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

Thank you, Wenchen.

I see. I'll update the doc and proceed to the next step manually as you advise. 
And it seems that we can stop the outdated Jenkins jobs, too.

Bests,
Dongjoon.

On Sat, Jan 5, 2019 at 20:15 Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
IIRC there was a change to the release process: we stop using the shared gpg 
key on Jenkins, but use the personal key of the release manager. I'm not sure 
Jenkins can help testing package anymore.

BTW release manager needs to run the packaging script by himself. If there is a 
problem, the release manager will find it out sooner or later.



On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

It turns out that `gpg signing` is the next huddle in Spark Packaging Jenkins.
Since 2.4.0 release, is there something changed in our Jenkins machine?

  gpg: skipped 
"/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
  gpg: signing failed: No secret key

Bests,
Dongjoon.


On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
https://issues.apache.org/jira/browse/SPARK-26537

On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
this may push in to early next week...  these builds were set up before my 
time, and i'm currently unraveling how they all work before pushing a commit to 
fix stuff.

nothing like some code archaeology to make my friday more exciting!  :)

shane

On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane!

Bests,
Dongjoon.

On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
yeah, i'll get on that today.  thanks for the heads up.

On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All

As a part of release process, we need to check Packaging/Compile/Test Jenkins 
status.

http://spark.apache.org/release-process.html

1. Spark Packaging: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
2. Spark QA Compile: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
3. Spark QA Test: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

Currently, (2) and (3) are working because it uses GitHub 
(https://github.com/apache/spark.git).
But, (1) seems to be broken because it's looking for old 
repo(https://git-wip-us.apache.org/repos/asf/spark.git/info/refs) instead of 
new GitBox.

Can we fix this in this week?

Bests,
Dongjoon.



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Spark Packaging Jenkins

2019-01-06 Thread Felix Cheung
The release process doc should have been updated on this - as mentioned we do 
not use Jenkins for release signing (take this offline if further discussion is 
needed)

The release build on Jenkins can still be useful for pre-validating the release 
build process (without actually signing it)



From: Dongjoon Hyun 
Sent: Saturday, January 5, 2019 9:46 PM
To: Wenchen Fan
Cc: dev; shane knapp
Subject: Re: Spark Packaging Jenkins

Thank you, Wenchen.

I see. I'll update the doc and proceed to the next step manually as you advise. 
And it seems that we can stop the outdated Jenkins jobs, too.

Bests,
Dongjoon.

On Sat, Jan 5, 2019 at 20:15 Wenchen Fan 
mailto:cloud0...@gmail.com>> wrote:
IIRC there was a change to the release process: we stop using the shared gpg 
key on Jenkins, but use the personal key of the release manager. I'm not sure 
Jenkins can help testing package anymore.

BTW release manager needs to run the packaging script by himself. If there is a 
problem, the release manager will find it out sooner or later.



On Sun, Jan 6, 2019 at 6:34 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

It turns out that `gpg signing` is the next huddle in Spark Packaging Jenkins.
Since 2.4.0 release, is there something changed in our Jenkins machine?

  gpg: skipped 
"/home/jenkins/workspace/spark-master-package/spark-utils/new-release-scripts/jenkins/jenkins-credentials-JEtz0nyn/gpg.tmp":
 No secret key
  gpg: signing failed: No secret key

Bests,
Dongjoon.


On Fri, Jan 4, 2019 at 11:52 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
https://issues.apache.org/jira/browse/SPARK-26537

On Fri, Jan 4, 2019 at 11:31 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
this may push in to early next week...  these builds were set up before my 
time, and i'm currently unraveling how they all work before pushing a commit to 
fix stuff.

nothing like some code archaeology to make my friday more exciting!  :)

shane

On Fri, Jan 4, 2019 at 11:08 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Thank you, Shane!

Bests,
Dongjoon.

On Fri, Jan 4, 2019 at 10:50 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
yeah, i'll get on that today.  thanks for the heads up.

On Fri, Jan 4, 2019 at 10:46 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All

As a part of release process, we need to check Packaging/Compile/Test Jenkins 
status.

http://spark.apache.org/release-process.html

1. Spark Packaging: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/
2. Spark QA Compile: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/
3. Spark QA Test: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/

Currently, (2) and (3) are working because it uses GitHub 
(https://github.com/apache/spark.git).
But, (1) seems to be broken because it's looking for old 
repo(https://git-wip-us.apache.org/repos/asf/spark.git/info/refs) instead of 
new GitBox.

Can we fix this in this week?

Bests,
Dongjoon.



--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Spark Scheduler - Task and job levels - How it Works?

2019-01-06 Thread Miguel F. S. Vasconcelos
Hello, 

I'm new to Spark and trying to understand how exactly spark scheduler works.

In the article /"Resilient Distributed Datasets: A Fault-Tolerant
Abstraction for In-Memory Cluster Computing"/ in section  5.1 Job
Scheduling" its said that:
/
"Whenever a user runs an action (e.g., count or save) on an RDD, the
scheduler examines that RDD’s lineage graph to build a DAG of stages to
execute, as illustrated in Figure 5. Each stage contains as many pipelined
transformations with narrow dependencies as possible. The boundaries of the
stages are the shuffle operations required for wide dependencies, or any
already computed partitions that can shortcircuit the computation of a
parent RDD. The scheduler then launches tasks to compute missing partitions
from each stage until it has computed the target RDD.
Our scheduler assigns tasks to machines based on data locality using delay
scheduling [32]. If a task needs to process a partition that is available in
memory on a node, we send it to that node. Otherwise, if a task processes a
partition for which the containing RDD provides preferred locations (e.g.,
an HDFS file), we send it to those. "/

After reading the gitbook/ "Mastering Apache Spark"/ by Jacek Laskowski and
some of the Spark's code, what I have understand about schedulling on spark
is this:

When an action is performed onto a RDD, Spark send it as a job to the
DAGScheduler;
The DAGScheduler compute the execution DAG based on the RDD's lineage, and
split the job into stages (using wide dependencies);
The resulting stages are transformed into a set of tasks, that are sent to
the TaskScheduler;
The TaskScheduler send the set of tasks to the executors, where they will
run.

Is this flow correct?

And are the jobs  discovered during the application execution and sent
sequentially to the DAGScheduler?

In the file /DAGScheduler.scala/ there's this comment:
/
* The high-level scheduling layer that implements stage-oriented scheduling.
It computes a DAG of
* stages for each job, keeps track of which RDDs and stage outputs are
materialized, *and finds a*
* *minimal schedule to run the job*. It then submits stages as TaskSets to
an underlying
* TaskScheduler implementation that runs them on the cluster. A TaskSet
contains fully independent
* tasks that can run right away based on the data that's already on the
cluster (e.g. map output
* files from previous stages), though it may fail if this data becomes
unavailable.
/

Regarding this part /"finds a minimal schedule to run the job"/, I have not
found this algorithm for getting the minimal schedule. Can you help me?


And, based on these comments:

File /TaskScheduler.scala/
/* Low-level task scheduler interface, currently implemented exclusively by
* [[org.apache.spark.scheduler.TaskSchedulerImpl]].
* This interface allows plugging in different task schedulers. Each
TaskScheduler schedules tasks
* for a single SparkContext. These schedulers get sets of tasks submitted to
them from the
* DAGScheduler for each stage, and are responsible for sending the tasks to
the cluster, running
* them, retrying if there are failures, and mitigating stragglers. They
return events to the
* DAGScheduler.
/
File/ TaskSchedulerImpl.scala/
/* Schedules tasks for multiple types of clusters by acting through a
SchedulerBackend.
/
File /SchedulerBackend.scala
/**
* A backend interface for scheduling systems that allows plugging in
different ones under
* TaskSchedulerImpl. We assume a Mesos-like model where the application gets
resource offers as
* machines become available and can launch tasks on them.
*/
/

And this from Spark docs:

/"Scheduling Within an Application
Inside a given Spark application (SparkContext instance), multiple parallel
jobs can run simultaneously if they were submitted from separate threads. By
“job”, in this section, we mean a Spark action (e.g. save, collect) and any
tasks that need to run to evaluate that action. Spark’s scheduler is fully
thread-safe and supports this use case to enable applications that serve
multiple requests (e.g. queries for multiple users).
By default, Spark’s scheduler runs jobs in FIFO fashion. Each job is divided
into “stages” (e.g. map and reduce phases), and the first job gets priority
on all available resources while its stages have tasks to launch, then the
second job gets priority, etc. If the jobs at the head of the queue don’t
need to use the whole cluster, later jobs can start to run right away, but
if the jobs at the head of the queue are large, then later jobs may be
delayed significantly.
Starting in Spark 0.8, it is also possible to configure fair sharing between
jobs. Under fair sharing, Spark assigns tasks between jobs in a “round
robin” fashion, so that all jobs get a roughly equal share of cluster
resources. This means that short jobs submitted while a long job is running
can start receiving resources right away and still get good response times,
without waiting for the long job to finish. This mode is best for