Re: Is spark fair scheduler is for kubernete?

2022-04-11 Thread Martin Grigorov
Hi,

On Mon, Apr 11, 2022 at 7:43 AM Jason Jun  wrote:

> the official doc, https://spark.apache.org/docs/latest/job-scheduling.html,
> didn't mention  that its working for kubernete cluster?
>

You could use Volcano scheduler for more advanced setups on Kubernetes.
Here is an article explaining how to make use of the fresh integration
between Spark and Volcano in 3.3 (not yet released!) -
https://martin-grigorov.medium.com/native-integration-between-apache-spark-and-volcano-kubernetes-scheduler-488f54dbbab3

Regards,
Martin


>
> Can anyone quickly answer this?
>
> TIA.
> Jason
>


Is spark fair scheduler is for kubernete?

2022-04-10 Thread Jason Jun
the official doc, https://spark.apache.org/docs/latest/job-scheduling.html,
didn't mention  that its working for kubernete cluster?

Can anyone quickly answer this?

TIA.
Jason


Spark FAIR Scheduler vs FIFO Scheduler

2018-06-18 Thread Alessandro Liparoti
Good morning,

I have a conceptual question. In an application I am working on, when I
write to HDFS some results (*action 1*), I use ~30 executors out of 200. I
would like to improve resource utilization in this case.
I am aware that repartitioning the df to 200 before action 1 would produce
200 tasks and full executors utilization, but for several reasons is not
what I want to do.
What I would like to do is using the other ~170 executors to work on the
actions (jobs) coming after action 1. The normal case would be that *action
2* starts after action 1 (FIFO), but here I want them to start at the same
time, using the idle executors.

My question is: is it something achievable with the FAIR scheduler approach
and if yes how?

As I read the fair scheduler needs a pool of jobs and then it schedules
their tasks in a round-robin fashion. If I submit action 1 and action 2 at
the same time (multi-threading) to a fair pool, which of the following
things happen?

   1. at every moment, all (or almost all) executors are used in parallel
   (30 for action 1, the rest for action 2)
   2. for a certain small amount of time X, 30 executors are used for
   action 1, then for another time X the other executors are used for action
   2, then again X unit of time for action 1 and so on...

Among the two, 1 will actually improve cluster utlization, while 2 will
allow only to have both jobs advancing at the same time. Can someone who
has knowledge about the FAIR scheduler help me understand how it works?

Thanks,
*Alessandro Liparoti*


Re: Fair scheduler pool leak

2018-04-09 Thread Imran Rashid
If I understand what you're trying to do correctly, I think you really just
want one pool, but you want to change the mode *within* the pool to be FAIR
as well

https://spark.apache.org/docs/latest/job-scheduling.html#configuring-pool-properties

you'd still need to change the conf file to set up that pool, but that
should be fairly straight-forward?  Another approach to what you're asking
might be to expose the scheduler configuration as command line confs as
well, which seems reasonable and simple.

On Sat, Apr 7, 2018 at 5:55 PM, Matthias Boehm <mboe...@gmail.com> wrote:

> well, the point was "in a programmatic way without the need for
> additional configuration files which is a hassle for a library" -
> anyway, I appreciate your comments.
>
> Regards,
> Matthias
>
> On Sat, Apr 7, 2018 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
> >> Providing a way to set the mode of the default scheduler would be
> awesome.
> >
> >
> > That's trivial: Just use the pool configuration XML file and define a
> pool
> > named "default" with the characteristics that you want (including
> > schedulingMode FAIR).
> >
> > You only get the default construction of the pool named "default" is you
> > don't define your own "default".
> >
> > On Sat, Apr 7, 2018 at 2:32 PM, Matthias Boehm <mboe...@gmail.com>
> wrote:
> >>
> >> No, these pools are not created per job but per parfor worker and
> >> thus, used to execute many jobs. For all scripts with a single
> >> top-level parfor this is equivalent to static initialization. However,
> >> yes we create these pools dynamically on demand to avoid unnecessary
> >> initialization and handle scenarios of nested parfor.
> >>
> >> At the end of the day, we just want to configure fair scheduling in a
> >> programmatic way without the need for additional configuration files
> >> which is a hassle for a library that is meant to work out-of-the-box.
> >> Simply setting 'spark.scheduler.mode' to FAIR does not do the trick
> >> because we end up with a single default fair scheduler pool in FIFO
> >> mode, which is equivalent to FIFO. Providing a way to set the mode of
> >> the default scheduler would be awesome.
> >>
> >> Regarding why fair scheduling showed generally better performance for
> >> out-of-core datasets, I don't have a good answer. My guess was
> >> isolated job scheduling and better locality of in-memory partitions.
> >>
> >> Regards,
> >> Matthias
> >>
> >> On Sat, Apr 7, 2018 at 8:50 AM, Mark Hamstra <m...@clearstorydata.com>
> >> wrote:
> >> > Sorry, but I'm still not understanding this use case. Are you somehow
> >> > creating additional scheduling pools dynamically as Jobs execute? If
> so,
> >> > that is a very unusual thing to do. Scheduling pools are intended to
> be
> >> > statically configured -- initialized, living and dying with the
> >> > Application.
> >> >
> >> > On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm <mboe...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Thanks for the clarification Imran - that helped. I was mistakenly
> >> >> assuming that these pools are removed via weak references, as the
> >> >> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
> >> >> the time being, we'll just work around it, but I'll file a
> >> >> nice-to-have improvement JIRA. Also, you're right, we see indeed
> these
> >> >> warnings but they're usually hidden when running with ERROR or INFO
> >> >> (due to overwhelming output) log levels.
> >> >>
> >> >> Just to give the context: We use these scheduler pools in SystemML's
> >> >> parallel for loop construct (parfor), which allows combining data-
> and
> >> >> task-parallel computation. If the data fits into the remote memory
> >> >> budget, the optimizer may decide to execute the entire loop as a
> >> >> single spark job (with groups of iterations mapped to spark tasks).
> If
> >> >> the data is too large and non-partitionable, the parfor loop is
> >> >> executed as a multi-threaded operator in the driver and each worker
> >> >> might spawn several data-parallel spark jobs in the context of the
> >> >> worker's scheduler pool, for operations that don't fit into the
> >> >> driver.
> >> >>
> >> &g

Re: Fair scheduler pool leak

2018-04-07 Thread Matthias Boehm
No, these pools are not created per job but per parfor worker and
thus, used to execute many jobs. For all scripts with a single
top-level parfor this is equivalent to static initialization. However,
yes we create these pools dynamically on demand to avoid unnecessary
initialization and handle scenarios of nested parfor.

At the end of the day, we just want to configure fair scheduling in a
programmatic way without the need for additional configuration files
which is a hassle for a library that is meant to work out-of-the-box.
Simply setting 'spark.scheduler.mode' to FAIR does not do the trick
because we end up with a single default fair scheduler pool in FIFO
mode, which is equivalent to FIFO. Providing a way to set the mode of
the default scheduler would be awesome.

Regarding why fair scheduling showed generally better performance for
out-of-core datasets, I don't have a good answer. My guess was
isolated job scheduling and better locality of in-memory partitions.

Regards,
Matthias

On Sat, Apr 7, 2018 at 8:50 AM, Mark Hamstra <m...@clearstorydata.com> wrote:
> Sorry, but I'm still not understanding this use case. Are you somehow
> creating additional scheduling pools dynamically as Jobs execute? If so,
> that is a very unusual thing to do. Scheduling pools are intended to be
> statically configured -- initialized, living and dying with the Application.
>
> On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm <mboe...@gmail.com> wrote:
>>
>> Thanks for the clarification Imran - that helped. I was mistakenly
>> assuming that these pools are removed via weak references, as the
>> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
>> the time being, we'll just work around it, but I'll file a
>> nice-to-have improvement JIRA. Also, you're right, we see indeed these
>> warnings but they're usually hidden when running with ERROR or INFO
>> (due to overwhelming output) log levels.
>>
>> Just to give the context: We use these scheduler pools in SystemML's
>> parallel for loop construct (parfor), which allows combining data- and
>> task-parallel computation. If the data fits into the remote memory
>> budget, the optimizer may decide to execute the entire loop as a
>> single spark job (with groups of iterations mapped to spark tasks). If
>> the data is too large and non-partitionable, the parfor loop is
>> executed as a multi-threaded operator in the driver and each worker
>> might spawn several data-parallel spark jobs in the context of the
>> worker's scheduler pool, for operations that don't fit into the
>> driver.
>>
>> We decided to use these fair scheduler pools (w/ fair scheduling
>> across pools, FIFO per pool) instead of the default FIFO scheduler
>> because it gave us better and more robust performance back in the
>> Spark 1.x line. This was especially true for concurrent jobs over
>> shared input data (e.g., for hyper parameter tuning) and when the data
>> size exceeded aggregate memory. The only downside was that we had to
>> guard against scenarios where concurrently jobs would lazily pull a
>> shared RDD into cache because that lead to thread contention on the
>> executors' block managers and spurious replicated in-memory
>> partitions.
>>
>> Regards,
>> Matthias
>>
>> On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid <iras...@cloudera.com> wrote:
>> > Hi Matthias,
>> >
>> > This doeesn't look possible now.  It may be worth filing an improvement
>> > jira
>> > for.
>> >
>> > But I'm trying to understand what you're trying to do a little better.
>> > So
>> > you intentionally have each thread create a new unique pool when its
>> > submits
>> > a job?  So that pool will just get the default pool configuration, and
>> > you
>> > will see lots of these messages in your logs?
>> >
>> >
>> > https://github.com/apache/spark/blob/6ade5cbb498f6c6ea38779b97f2325d5cf5013f2/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L196-L200
>> >
>> > What is the use case for creating pools this way?
>> >
>> > Also if I understand correctly, it doesn't even matter if the thread
>> > dies --
>> > that pool will still stay around, as the rootPool will retain a
>> > reference to
>> > its (the pools aren't really actually tied to specific threads).
>> >
>> > Imran
>> >
>> > On Thu, Apr 5, 2018 at 9:46 PM, Matthias Boehm <mboe...@gmail.com>
>> > wrote:
>> >>
>> >> Hi all,
>> >>
>> >> for concurrent Spark jobs spawned from the driver, we us

Re: Fair scheduler pool leak

2018-04-07 Thread Matthias Boehm
well, the point was "in a programmatic way without the need for
additional configuration files which is a hassle for a library" -
anyway, I appreciate your comments.

Regards,
Matthias

On Sat, Apr 7, 2018 at 3:43 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
>> Providing a way to set the mode of the default scheduler would be awesome.
>
>
> That's trivial: Just use the pool configuration XML file and define a pool
> named "default" with the characteristics that you want (including
> schedulingMode FAIR).
>
> You only get the default construction of the pool named "default" is you
> don't define your own "default".
>
> On Sat, Apr 7, 2018 at 2:32 PM, Matthias Boehm <mboe...@gmail.com> wrote:
>>
>> No, these pools are not created per job but per parfor worker and
>> thus, used to execute many jobs. For all scripts with a single
>> top-level parfor this is equivalent to static initialization. However,
>> yes we create these pools dynamically on demand to avoid unnecessary
>> initialization and handle scenarios of nested parfor.
>>
>> At the end of the day, we just want to configure fair scheduling in a
>> programmatic way without the need for additional configuration files
>> which is a hassle for a library that is meant to work out-of-the-box.
>> Simply setting 'spark.scheduler.mode' to FAIR does not do the trick
>> because we end up with a single default fair scheduler pool in FIFO
>> mode, which is equivalent to FIFO. Providing a way to set the mode of
>> the default scheduler would be awesome.
>>
>> Regarding why fair scheduling showed generally better performance for
>> out-of-core datasets, I don't have a good answer. My guess was
>> isolated job scheduling and better locality of in-memory partitions.
>>
>> Regards,
>> Matthias
>>
>> On Sat, Apr 7, 2018 at 8:50 AM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>> > Sorry, but I'm still not understanding this use case. Are you somehow
>> > creating additional scheduling pools dynamically as Jobs execute? If so,
>> > that is a very unusual thing to do. Scheduling pools are intended to be
>> > statically configured -- initialized, living and dying with the
>> > Application.
>> >
>> > On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm <mboe...@gmail.com>
>> > wrote:
>> >>
>> >> Thanks for the clarification Imran - that helped. I was mistakenly
>> >> assuming that these pools are removed via weak references, as the
>> >> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
>> >> the time being, we'll just work around it, but I'll file a
>> >> nice-to-have improvement JIRA. Also, you're right, we see indeed these
>> >> warnings but they're usually hidden when running with ERROR or INFO
>> >> (due to overwhelming output) log levels.
>> >>
>> >> Just to give the context: We use these scheduler pools in SystemML's
>> >> parallel for loop construct (parfor), which allows combining data- and
>> >> task-parallel computation. If the data fits into the remote memory
>> >> budget, the optimizer may decide to execute the entire loop as a
>> >> single spark job (with groups of iterations mapped to spark tasks). If
>> >> the data is too large and non-partitionable, the parfor loop is
>> >> executed as a multi-threaded operator in the driver and each worker
>> >> might spawn several data-parallel spark jobs in the context of the
>> >> worker's scheduler pool, for operations that don't fit into the
>> >> driver.
>> >>
>> >> We decided to use these fair scheduler pools (w/ fair scheduling
>> >> across pools, FIFO per pool) instead of the default FIFO scheduler
>> >> because it gave us better and more robust performance back in the
>> >> Spark 1.x line. This was especially true for concurrent jobs over
>> >> shared input data (e.g., for hyper parameter tuning) and when the data
>> >> size exceeded aggregate memory. The only downside was that we had to
>> >> guard against scenarios where concurrently jobs would lazily pull a
>> >> shared RDD into cache because that lead to thread contention on the
>> >> executors' block managers and spurious replicated in-memory
>> >> partitions.
>> >>
>> >> Regards,
>> >> Matthias
>> >>
>> >> On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid <iras...@cloudera.com>
>> >> wrot

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
>
> Providing a way to set the mode of the default scheduler would be awesome.


That's trivial: Just use the pool configuration XML file and define a pool
named "default" with the characteristics that you want (including
schedulingMode FAIR).

You only get the default construction of the pool named "default" is you
don't define your own "default".

On Sat, Apr 7, 2018 at 2:32 PM, Matthias Boehm <mboe...@gmail.com> wrote:

> No, these pools are not created per job but per parfor worker and
> thus, used to execute many jobs. For all scripts with a single
> top-level parfor this is equivalent to static initialization. However,
> yes we create these pools dynamically on demand to avoid unnecessary
> initialization and handle scenarios of nested parfor.
>
> At the end of the day, we just want to configure fair scheduling in a
> programmatic way without the need for additional configuration files
> which is a hassle for a library that is meant to work out-of-the-box.
> Simply setting 'spark.scheduler.mode' to FAIR does not do the trick
> because we end up with a single default fair scheduler pool in FIFO
> mode, which is equivalent to FIFO. Providing a way to set the mode of
> the default scheduler would be awesome.
>
> Regarding why fair scheduling showed generally better performance for
> out-of-core datasets, I don't have a good answer. My guess was
> isolated job scheduling and better locality of in-memory partitions.
>
> Regards,
> Matthias
>
> On Sat, Apr 7, 2018 at 8:50 AM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
> > Sorry, but I'm still not understanding this use case. Are you somehow
> > creating additional scheduling pools dynamically as Jobs execute? If so,
> > that is a very unusual thing to do. Scheduling pools are intended to be
> > statically configured -- initialized, living and dying with the
> Application.
> >
> > On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm <mboe...@gmail.com>
> wrote:
> >>
> >> Thanks for the clarification Imran - that helped. I was mistakenly
> >> assuming that these pools are removed via weak references, as the
> >> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
> >> the time being, we'll just work around it, but I'll file a
> >> nice-to-have improvement JIRA. Also, you're right, we see indeed these
> >> warnings but they're usually hidden when running with ERROR or INFO
> >> (due to overwhelming output) log levels.
> >>
> >> Just to give the context: We use these scheduler pools in SystemML's
> >> parallel for loop construct (parfor), which allows combining data- and
> >> task-parallel computation. If the data fits into the remote memory
> >> budget, the optimizer may decide to execute the entire loop as a
> >> single spark job (with groups of iterations mapped to spark tasks). If
> >> the data is too large and non-partitionable, the parfor loop is
> >> executed as a multi-threaded operator in the driver and each worker
> >> might spawn several data-parallel spark jobs in the context of the
> >> worker's scheduler pool, for operations that don't fit into the
> >> driver.
> >>
> >> We decided to use these fair scheduler pools (w/ fair scheduling
> >> across pools, FIFO per pool) instead of the default FIFO scheduler
> >> because it gave us better and more robust performance back in the
> >> Spark 1.x line. This was especially true for concurrent jobs over
> >> shared input data (e.g., for hyper parameter tuning) and when the data
> >> size exceeded aggregate memory. The only downside was that we had to
> >> guard against scenarios where concurrently jobs would lazily pull a
> >> shared RDD into cache because that lead to thread contention on the
> >> executors' block managers and spurious replicated in-memory
> >> partitions.
> >>
> >> Regards,
> >> Matthias
> >>
> >> On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid <iras...@cloudera.com>
> wrote:
> >> > Hi Matthias,
> >> >
> >> > This doeesn't look possible now.  It may be worth filing an
> improvement
> >> > jira
> >> > for.
> >> >
> >> > But I'm trying to understand what you're trying to do a little better.
> >> > So
> >> > you intentionally have each thread create a new unique pool when its
> >> > submits
> >> > a job?  So that pool will just get the default pool configuration, and
> >> > you
> >> > will see lots of these messages in your logs?
&

Re: Fair scheduler pool leak

2018-04-07 Thread Mark Hamstra
Sorry, but I'm still not understanding this use case. Are you somehow
creating additional scheduling pools dynamically as Jobs execute? If so,
that is a very unusual thing to do. Scheduling pools are intended to be
statically configured -- initialized, living and dying with the
Application.

On Sat, Apr 7, 2018 at 12:33 AM, Matthias Boehm <mboe...@gmail.com> wrote:

> Thanks for the clarification Imran - that helped. I was mistakenly
> assuming that these pools are removed via weak references, as the
> ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
> the time being, we'll just work around it, but I'll file a
> nice-to-have improvement JIRA. Also, you're right, we see indeed these
> warnings but they're usually hidden when running with ERROR or INFO
> (due to overwhelming output) log levels.
>
> Just to give the context: We use these scheduler pools in SystemML's
> parallel for loop construct (parfor), which allows combining data- and
> task-parallel computation. If the data fits into the remote memory
> budget, the optimizer may decide to execute the entire loop as a
> single spark job (with groups of iterations mapped to spark tasks). If
> the data is too large and non-partitionable, the parfor loop is
> executed as a multi-threaded operator in the driver and each worker
> might spawn several data-parallel spark jobs in the context of the
> worker's scheduler pool, for operations that don't fit into the
> driver.
>
> We decided to use these fair scheduler pools (w/ fair scheduling
> across pools, FIFO per pool) instead of the default FIFO scheduler
> because it gave us better and more robust performance back in the
> Spark 1.x line. This was especially true for concurrent jobs over
> shared input data (e.g., for hyper parameter tuning) and when the data
> size exceeded aggregate memory. The only downside was that we had to
> guard against scenarios where concurrently jobs would lazily pull a
> shared RDD into cache because that lead to thread contention on the
> executors' block managers and spurious replicated in-memory
> partitions.
>
> Regards,
> Matthias
>
> On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid <iras...@cloudera.com> wrote:
> > Hi Matthias,
> >
> > This doeesn't look possible now.  It may be worth filing an improvement
> jira
> > for.
> >
> > But I'm trying to understand what you're trying to do a little better.
> So
> > you intentionally have each thread create a new unique pool when its
> submits
> > a job?  So that pool will just get the default pool configuration, and
> you
> > will see lots of these messages in your logs?
> >
> > https://github.com/apache/spark/blob/6ade5cbb498f6c6ea38779b97f2325
> d5cf5013f2/core/src/main/scala/org/apache/spark/
> scheduler/SchedulableBuilder.scala#L196-L200
> >
> > What is the use case for creating pools this way?
> >
> > Also if I understand correctly, it doesn't even matter if the thread
> dies --
> > that pool will still stay around, as the rootPool will retain a
> reference to
> > its (the pools aren't really actually tied to specific threads).
> >
> > Imran
> >
> > On Thu, Apr 5, 2018 at 9:46 PM, Matthias Boehm <mboe...@gmail.com>
> wrote:
> >>
> >> Hi all,
> >>
> >> for concurrent Spark jobs spawned from the driver, we use Spark's fair
> >> scheduler pools, which are set and unset in a thread-local manner by
> >> each worker thread. Typically (for rather long jobs), this works very
> >> well. Unfortunately, in an application with lots of very short
> >> parallel sections, we see 1000s of these pools remaining in the Spark
> >> UI, which indicates some kind of leak. Each worker cleans up its local
> >> property by setting it to null, but not all pools are properly
> >> removed. I've checked and reproduced this behavior with Spark 2.1-2.3.
> >>
> >> Now my question: Is there a way to explicitly remove these pools,
> >> either globally, or locally while the thread is still alive?
> >>
> >> Regards,
> >> Matthias
> >>
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Fair scheduler pool leak

2018-04-07 Thread Matthias Boehm
Thanks for the clarification Imran - that helped. I was mistakenly
assuming that these pools are removed via weak references, as the
ContextCleaner does for RDDs, broadcasts, and accumulators, etc. For
the time being, we'll just work around it, but I'll file a
nice-to-have improvement JIRA. Also, you're right, we see indeed these
warnings but they're usually hidden when running with ERROR or INFO
(due to overwhelming output) log levels.

Just to give the context: We use these scheduler pools in SystemML's
parallel for loop construct (parfor), which allows combining data- and
task-parallel computation. If the data fits into the remote memory
budget, the optimizer may decide to execute the entire loop as a
single spark job (with groups of iterations mapped to spark tasks). If
the data is too large and non-partitionable, the parfor loop is
executed as a multi-threaded operator in the driver and each worker
might spawn several data-parallel spark jobs in the context of the
worker's scheduler pool, for operations that don't fit into the
driver.

We decided to use these fair scheduler pools (w/ fair scheduling
across pools, FIFO per pool) instead of the default FIFO scheduler
because it gave us better and more robust performance back in the
Spark 1.x line. This was especially true for concurrent jobs over
shared input data (e.g., for hyper parameter tuning) and when the data
size exceeded aggregate memory. The only downside was that we had to
guard against scenarios where concurrently jobs would lazily pull a
shared RDD into cache because that lead to thread contention on the
executors' block managers and spurious replicated in-memory
partitions.

Regards,
Matthias

On Fri, Apr 6, 2018 at 8:08 AM, Imran Rashid <iras...@cloudera.com> wrote:
> Hi Matthias,
>
> This doeesn't look possible now.  It may be worth filing an improvement jira
> for.
>
> But I'm trying to understand what you're trying to do a little better.  So
> you intentionally have each thread create a new unique pool when its submits
> a job?  So that pool will just get the default pool configuration, and you
> will see lots of these messages in your logs?
>
> https://github.com/apache/spark/blob/6ade5cbb498f6c6ea38779b97f2325d5cf5013f2/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L196-L200
>
> What is the use case for creating pools this way?
>
> Also if I understand correctly, it doesn't even matter if the thread dies --
> that pool will still stay around, as the rootPool will retain a reference to
> its (the pools aren't really actually tied to specific threads).
>
> Imran
>
> On Thu, Apr 5, 2018 at 9:46 PM, Matthias Boehm <mboe...@gmail.com> wrote:
>>
>> Hi all,
>>
>> for concurrent Spark jobs spawned from the driver, we use Spark's fair
>> scheduler pools, which are set and unset in a thread-local manner by
>> each worker thread. Typically (for rather long jobs), this works very
>> well. Unfortunately, in an application with lots of very short
>> parallel sections, we see 1000s of these pools remaining in the Spark
>> UI, which indicates some kind of leak. Each worker cleans up its local
>> property by setting it to null, but not all pools are properly
>> removed. I've checked and reproduced this behavior with Spark 2.1-2.3.
>>
>> Now my question: Is there a way to explicitly remove these pools,
>> either globally, or locally while the thread is still alive?
>>
>> Regards,
>> Matthias
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Fair scheduler pool leak

2018-04-06 Thread Imran Rashid
Hi Matthias,

This doeesn't look possible now.  It may be worth filing an improvement
jira for.

But I'm trying to understand what you're trying to do a little better.  So
you intentionally have each thread create a new unique pool when its
submits a job?  So that pool will just get the default pool configuration,
and you will see lots of these messages in your logs?

https://github.com/apache/spark/blob/6ade5cbb498f6c6ea38779b97f2325d5cf5013f2/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L196-L200

What is the use case for creating pools this way?

Also if I understand correctly, it doesn't even matter if the thread dies
-- that pool will still stay around, as the rootPool will retain a
reference to its (the pools aren't really actually tied to specific
threads).

Imran

On Thu, Apr 5, 2018 at 9:46 PM, Matthias Boehm <mboe...@gmail.com> wrote:

> Hi all,
>
> for concurrent Spark jobs spawned from the driver, we use Spark's fair
> scheduler pools, which are set and unset in a thread-local manner by
> each worker thread. Typically (for rather long jobs), this works very
> well. Unfortunately, in an application with lots of very short
> parallel sections, we see 1000s of these pools remaining in the Spark
> UI, which indicates some kind of leak. Each worker cleans up its local
> property by setting it to null, but not all pools are properly
> removed. I've checked and reproduced this behavior with Spark 2.1-2.3.
>
> Now my question: Is there a way to explicitly remove these pools,
> either globally, or locally while the thread is still alive?
>
> Regards,
> Matthias
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Fair scheduler pool leak

2018-04-05 Thread Matthias Boehm
Hi all,

for concurrent Spark jobs spawned from the driver, we use Spark's fair
scheduler pools, which are set and unset in a thread-local manner by
each worker thread. Typically (for rather long jobs), this works very
well. Unfortunately, in an application with lots of very short
parallel sections, we see 1000s of these pools remaining in the Spark
UI, which indicates some kind of leak. Each worker cleans up its local
property by setting it to null, but not all pools are properly
removed. I've checked and reproduced this behavior with Spark 2.1-2.3.

Now my question: Is there a way to explicitly remove these pools,
either globally, or locally while the thread is still alive?

Regards,
Matthias

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: fair scheduler

2014-08-12 Thread fireflyc
@Crystal
You can use spark on yarn. Yarn have fair scheduler,modified yarn-site.xml.

发自我的 iPad

 在 2014年8月11日,6:49,Matei Zaharia matei.zaha...@gmail.com 写道:
 
 Hi Crystal,
 
 The fair scheduler is only for jobs running concurrently within the same 
 SparkContext (i.e. within an application), not for separate applications on 
 the standalone cluster manager. It has no effect there. To run more of those 
 concurrently, you need to set a cap on how many cores they each grab with 
 spark.cores.max.
 
 Matei
 
 On August 10, 2014 at 12:13:08 PM, 李宜芳 (xuite...@gmail.com) wrote:
 
 Hi  
 
 I am trying to switch from FIFO to FAIR with standalone mode.  
 
 my environment:  
 hadoop 1.2.1  
 spark 0.8.0 using stanalone mode  
 
 and i modified the code..  
 
 ClusterScheduler.scala - System.getProperty(spark.scheduler.mode,  
 FAIR))  
 SchedulerBuilder.scala -  
 val DEFAULT_SCHEDULING_MODE = SchedulingMode.FAIR  
 
 LocalScheduler.scala -  
 System.getProperty(spark.scheduler.mode, FAIR)  
 
 spark-env.sh -  
 export SPARK_JAVA_OPTS=-Dspark.scheduler.mode=FAIR  
 export SPARK_JAVA_OPTS= -Dspark.scheduler.mode=FAIR ./run-example  
 org.apache.spark.examples.SparkPi spark://streaming1:7077  
 
 
 but it's not work  
 i want to switch from fifo to fair  
 how can i do??  
 
 Regards  
 Crystal Lee  
 


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



fair scheduler

2014-08-10 Thread 李宜芳
Hi

I am trying to switch from FIFO to FAIR with standalone mode.

my environment:
hadoop 1.2.1
spark 0.8.0 using stanalone mode

and i modified the code..

ClusterScheduler.scala  - System.getProperty(spark.scheduler.mode,
FAIR))
SchedulerBuilder.scala  -
val DEFAULT_SCHEDULING_MODE = SchedulingMode.FAIR

LocalScheduler.scala -
System.getProperty(spark.scheduler.mode, FAIR)

spark-env.sh -
export SPARK_JAVA_OPTS=-Dspark.scheduler.mode=FAIR
export SPARK_JAVA_OPTS= -Dspark.scheduler.mode=FAIR ./run-example
org.apache.spark.examples.SparkPi spark://streaming1:7077


but it's not work
i want to switch from fifo to fair
how can i  do??

Regards
Crystal Lee