Re: Spark job only starts tasks on a single node

2017-12-07 Thread Ji Yan
This used to work. Only thing that has changed is that the mesos installed
on Spark executor is on a different version from before. My Spark executor
runs in a container, the image of which has mesos installed. The version of
that mesos is actually different from the version of mesos master. Not sure
if that is the problem though. I am trying to bring back the old version
mesos to Spark executor image. Did anyone know that mesos slave and master
not running the same version could lead to this problem?

On Thu, Dec 7, 2017 at 11:34 AM, Art Rand <art.r...@gmail.com> wrote:

> Sounds a little like the driver got one offer when it was using zero
> resources, then it's not getting any more. How many frameworks (and which)
> are running on the cluster? The Mesos Master log should say which
> frameworks are getting offers, and should help diagnose the problem.
>
> A
>
> On Thu, Dec 7, 2017 at 10:18 AM, Susan X. Huynh <xhu...@mesosphere.io>
> wrote:
>
>> Sounds strange. Maybe it has to do with the job itself? What kind of job
>> is it? Have you gotten it to run on more than one node before? What's in
>> the spark-submit command?
>>
>> Susan
>>
>> On Wed, Dec 6, 2017 at 11:21 AM, Ji Yan <ji...@drive.ai> wrote:
>>
>>> I am sure that the other agents have plentiful enough resources, but I
>>> don't know why Spark only scheduled executors on one single node, up to
>>> that node's capacity ( it is a different node everytime I run btw ).
>>>
>>> I checked the DEBUG log from Spark Driver, didn't see any mention of
>>> decline. But from log, it looks like it has only accepted one offer from
>>> Mesos.
>>>
>>> Also looks like there is no special role required on Spark part!
>>>
>>> On Wed, Dec 6, 2017 at 5:57 AM, Art Rand <art.r...@gmail.com> wrote:
>>>
>>>> Hello Ji,
>>>>
>>>> Spark will launch Executors round-robin on offers, so when the
>>>> resources on an agent get broken into multiple resource offers it's
>>>> possible that many Executrors get placed on a single agent. However, from
>>>> your description, it's not clear why your other agents do not get Executors
>>>> scheduled on them. It's possible that the offers from your other agents are
>>>> insufficient in some way. The Mesos MASTER log should show offers being
>>>> declined by your Spark Driver, do you see that?  If you have DEBUG level
>>>> logging in your Spark driver you should also see offers being declined
>>>> <https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
>>>> there. Finally if your Spark framework isn't receiving any resource offers,
>>>> it could be because of the roles you have established on your agents or
>>>> quota set on other frameworks, have you set up any of that? Hope this 
>>>> helps!
>>>>
>>>> Art
>>>>
>>>> On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job
>>>>> onto several nodes. I try to set the number of executors by the formula
>>>>> (spark.cores.max / spark.executor.cores). The behavior I saw was that 
>>>>> Spark
>>>>> will try to fill up on one mesos node as many executors as it can, then it
>>>>> stops going to other mesos nodes despite that it has not done scheduling
>>>>> all the executors I have asked it to yet! This is super weird!
>>>>>
>>>>> Did anyone notice this behavior before? Any help appreciated!
>>>>>
>>>>> Ji
>>>>>
>>>>> The information in this email is confidential and may be legally
>>>>> privileged. It is intended solely for the addressee. Access to this email
>>>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>>>> disclosure, copying, distribution or any action taken or omitted to be
>>>>> taken in reliance on it, is prohibited and may be unlawful.
>>>>>
>>>>
>>>>
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>>
>> --
>> Susan X. Huynh
>> Software engineer, Data Agility
>> xhu...@mesosphere.com
>>
>
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Re: Spark job only starts tasks on a single node

2017-12-06 Thread Ji Yan
I am sure that the other agents have plentiful enough resources, but I
don't know why Spark only scheduled executors on one single node, up to
that node's capacity ( it is a different node everytime I run btw ).

I checked the DEBUG log from Spark Driver, didn't see any mention of
decline. But from log, it looks like it has only accepted one offer from
Mesos.

Also looks like there is no special role required on Spark part!

On Wed, Dec 6, 2017 at 5:57 AM, Art Rand <art.r...@gmail.com> wrote:

> Hello Ji,
>
> Spark will launch Executors round-robin on offers, so when the resources
> on an agent get broken into multiple resource offers it's possible that
> many Executrors get placed on a single agent. However, from your
> description, it's not clear why your other agents do not get Executors
> scheduled on them. It's possible that the offers from your other agents are
> insufficient in some way. The Mesos MASTER log should show offers being
> declined by your Spark Driver, do you see that?  If you have DEBUG level
> logging in your Spark driver you should also see offers being declined
> <https://github.com/apache/spark/blob/193555f79cc73873613674a09a7c371688b6dbc7/resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala#L576>
> there. Finally if your Spark framework isn't receiving any resource offers,
> it could be because of the roles you have established on your agents or
> quota set on other frameworks, have you set up any of that? Hope this helps!
>
> Art
>
> On Tue, Dec 5, 2017 at 10:45 PM, Ji Yan <ji...@drive.ai> wrote:
>
>> Hi all,
>>
>> I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto
>> several nodes. I try to set the number of executors by the formula
>> (spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
>> will try to fill up on one mesos node as many executors as it can, then it
>> stops going to other mesos nodes despite that it has not done scheduling
>> all the executors I have asked it to yet! This is super weird!
>>
>> Did anyone notice this behavior before? Any help appreciated!
>>
>> Ji
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Spark job only starts tasks on a single node

2017-12-05 Thread Ji Yan
Hi all,

I am running Spark 2.0 on Mesos 1.1. I was trying to split up my job onto
several nodes. I try to set the number of executors by the formula
(spark.cores.max / spark.executor.cores). The behavior I saw was that Spark
will try to fill up on one mesos node as many executors as it can, then it
stops going to other mesos nodes despite that it has not done scheduling
all the executors I have asked it to yet! This is super weird!

Did anyone notice this behavior before? Any help appreciated!

Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Spark executor on Docker runs as root

2017-02-23 Thread Ji Yan
Dear spark users,

When running Spark on Docker, the spark executors by default always run as
root. Is there a way to change this to other users?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Will Spark ever run the same task at the same time

2017-02-16 Thread Ji Yan
Dear spark users,

Is there any mechanism in Spark that does not guarantee the idempotent
nature? For example, for stranglers, the framework might start another task
assuming the strangler is slow while the strangler is still running. This
would be annoying sometime when say the task is writing to a file, but have
the same tasks running at the same time may corrupt the file. From the
documentation page, I know that Spark's speculative execution mode is
turned off by default. Does anyone know any other mechanism in Spark that
may cause problem in scenario like this?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Question about best Spark tuning

2017-02-09 Thread Ji Yan
Dear spark users,

>From this site https://spark.apache.org/docs/latest/tuning.html where it
offers recommendation on setting the level of parallelism

Clusters will not be fully utilized unless you set the level of parallelism
> for each operation high enough. Spark automatically sets the number of
> “map” tasks to run on each file according to its size (though you can
> control it through optional parameters to SparkContext.textFile, etc),
> and for distributed “reduce” operations, such as groupByKey and
> reduceByKey, it uses the largest parent RDD’s number of partitions. You
> can pass the level of parallelism as a second argument (see the
> spark.PairRDDFunctions
> 
>  documentation),
> or set the config property spark.default.parallelism to change the
> default. *In general, we recommend 2-3 tasks per CPU core in your cluster*
> .


Do people have a general theory/intuition about why it is a good idea to
have 2-3 tasks running per CPU core?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread ji yan
got it, thanks for clarifying!

On Thu, Feb 2, 2017 at 2:57 PM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> Yes, that's expected.  spark.executor.cores sizes a single executor.  It
> doesn't limit the number of executors.  For that, you need spark.cores.max
> (--total-executor-cores).
>
> And rdd.parallelize does not specify the number of executors.  It
> specifies the number of partitions, which relates to the number of tasks,
> not executors.  Unless you're running with dynamic allocation enabled, the
> number of executors for your job is static, and determined at start time.
> It's not influenced by your job itself.
>
>
> On Thu, Feb 2, 2017 at 2:42 PM, Ji Yan <ji...@drive.ai> wrote:
>
>> I tried setting spark.executor.cores per executor, but Spark seems to be
>> spinning up as many executors as possible up to spark.cores.max or however
>> many cpu cores available on the cluster, and this may be undesirable
>> because the number of executors in rdd.parallelize(collection, # of
>> partitions) is being overriden
>>
>> On Thu, Feb 2, 2017 at 1:30 PM, Michael Gummelt <mgumm...@mesosphere.io>
>> wrote:
>>
>>> As of Spark 2.0, Mesos mode does support setting cores on the executor
>>> level, but you might need to set the property directly (--conf
>>> spark.executor.cores=).  I've written about this here:
>>> https://docs.mesosphere.com/1.8/usage/service-guides/spark/j
>>> ob-scheduling/.  That doc is for DC/OS, but the configuration is the
>>> same.
>>>
>>> On Thu, Feb 2, 2017 at 1:06 PM, Ji Yan <ji...@drive.ai> wrote:
>>>
>>>> I was mainly confused why this is the case with memory, but with cpu
>>>> cores, it is not specified on per executor level
>>>>
>>>> On Thu, Feb 2, 2017 at 1:02 PM, Michael Gummelt <mgumm...@mesosphere.io
>>>> > wrote:
>>>>
>>>>> It sounds like you've answered your own question, right?
>>>>> --executor-memory means the memory per executor.  If you have no executor
>>>>> w/ 200GB memory, then the driver will accept no offers.
>>>>>
>>>>> On Thu, Feb 2, 2017 at 1:01 PM, Ji Yan <ji...@drive.ai> wrote:
>>>>>
>>>>>> sorry, to clarify, i was using --executor-memory for memory,
>>>>>> and --total-executor-cores for cpu cores
>>>>>>
>>>>>> On Thu, Feb 2, 2017 at 12:56 PM, Michael Gummelt <
>>>>>> mgumm...@mesosphere.io> wrote:
>>>>>>
>>>>>>> What CLI args are your referring to?  I'm aware of spark-submit's
>>>>>>> arguments (--executor-memory, --total-executor-cores, and 
>>>>>>> --executor-cores)
>>>>>>>
>>>>>>> On Thu, Feb 2, 2017 at 12:41 PM, Ji Yan <ji...@drive.ai> wrote:
>>>>>>>
>>>>>>>> I have done a experiment on this today. It shows that only CPUs are
>>>>>>>> tolerant of insufficient cluster size when a job starts. On my 
>>>>>>>> cluster, I
>>>>>>>> have 180Gb of memory and 64 cores, when I run spark-submit ( on mesos )
>>>>>>>> with --cpu_cores set to 1000, the job starts up with 64 cores. but 
>>>>>>>> when I
>>>>>>>> set --memory to 200Gb, the job fails to start with "Initial job
>>>>>>>> has not accepted any resources; check your cluster UI to ensure that
>>>>>>>> workers are registered and have sufficient resources"
>>>>>>>>
>>>>>>>> Also it is confusing to me that --cpu_cores specifies the number of
>>>>>>>> cpu cores across all executors, but --memory specifies per executor 
>>>>>>>> memory
>>>>>>>> requirement.
>>>>>>>>
>>>>>>>> On Mon, Jan 30, 2017 at 11:34 AM, Michael Gummelt <
>>>>>>>> mgumm...@mesosphere.io> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 30, 2017 at 9:47 AM, Ji Yan <ji...@drive.ai> wrote:
>>>>>>>>>
>>>>>>>>>> Tasks begin scheduling as soon as the first executor comes up
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks all for the clarification. Is this the de

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
I tried setting spark.executor.cores per executor, but Spark seems to be
spinning up as many executors as possible up to spark.cores.max or however
many cpu cores available on the cluster, and this may be undesirable
because the number of executors in rdd.parallelize(collection, # of
partitions) is being overriden

On Thu, Feb 2, 2017 at 1:30 PM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> As of Spark 2.0, Mesos mode does support setting cores on the executor
> level, but you might need to set the property directly (--conf
> spark.executor.cores=).  I've written about this here:
> https://docs.mesosphere.com/1.8/usage/service-guides/spark/job-scheduling/.
> That doc is for DC/OS, but the configuration is the same.
>
> On Thu, Feb 2, 2017 at 1:06 PM, Ji Yan <ji...@drive.ai> wrote:
>
>> I was mainly confused why this is the case with memory, but with cpu
>> cores, it is not specified on per executor level
>>
>> On Thu, Feb 2, 2017 at 1:02 PM, Michael Gummelt <mgumm...@mesosphere.io>
>> wrote:
>>
>>> It sounds like you've answered your own question, right?
>>> --executor-memory means the memory per executor.  If you have no executor
>>> w/ 200GB memory, then the driver will accept no offers.
>>>
>>> On Thu, Feb 2, 2017 at 1:01 PM, Ji Yan <ji...@drive.ai> wrote:
>>>
>>>> sorry, to clarify, i was using --executor-memory for memory,
>>>> and --total-executor-cores for cpu cores
>>>>
>>>> On Thu, Feb 2, 2017 at 12:56 PM, Michael Gummelt <
>>>> mgumm...@mesosphere.io> wrote:
>>>>
>>>>> What CLI args are your referring to?  I'm aware of spark-submit's
>>>>> arguments (--executor-memory, --total-executor-cores, and 
>>>>> --executor-cores)
>>>>>
>>>>> On Thu, Feb 2, 2017 at 12:41 PM, Ji Yan <ji...@drive.ai> wrote:
>>>>>
>>>>>> I have done a experiment on this today. It shows that only CPUs are
>>>>>> tolerant of insufficient cluster size when a job starts. On my cluster, I
>>>>>> have 180Gb of memory and 64 cores, when I run spark-submit ( on mesos )
>>>>>> with --cpu_cores set to 1000, the job starts up with 64 cores. but when I
>>>>>> set --memory to 200Gb, the job fails to start with "Initial job has
>>>>>> not accepted any resources; check your cluster UI to ensure that workers
>>>>>> are registered and have sufficient resources"
>>>>>>
>>>>>> Also it is confusing to me that --cpu_cores specifies the number of
>>>>>> cpu cores across all executors, but --memory specifies per executor 
>>>>>> memory
>>>>>> requirement.
>>>>>>
>>>>>> On Mon, Jan 30, 2017 at 11:34 AM, Michael Gummelt <
>>>>>> mgumm...@mesosphere.io> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 30, 2017 at 9:47 AM, Ji Yan <ji...@drive.ai> wrote:
>>>>>>>
>>>>>>>> Tasks begin scheduling as soon as the first executor comes up
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks all for the clarification. Is this the default behavior of
>>>>>>>> Spark on Mesos today? I think this is what we are looking for because
>>>>>>>> sometimes a job can take up lots of resources and later jobs could not 
>>>>>>>> get
>>>>>>>> all the resources that it asks for. If a Spark job starts with only a
>>>>>>>> subset of resources that it asks for, does it know to expand its 
>>>>>>>> resources
>>>>>>>> later when more resources become available?
>>>>>>>>
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Launch each executor with at least 1GB RAM, but if mesos offers 2GB
>>>>>>>>> at some moment, then launch an executor with 2GB RAM
>>>>>>>>
>>>>>>>>
>>>>>>>> This is less useful in our use case. But I am also quite interested
>>>>>>>> in cases in which this could be helpful. I think this will also help 
>>>>>>>> with
>>>>>>>> overall resource utilization on the cluster if when another job

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
I was mainly confused why this is the case with memory, but with cpu cores,
it is not specified on per executor level

On Thu, Feb 2, 2017 at 1:02 PM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> It sounds like you've answered your own question, right?
> --executor-memory means the memory per executor.  If you have no executor
> w/ 200GB memory, then the driver will accept no offers.
>
> On Thu, Feb 2, 2017 at 1:01 PM, Ji Yan <ji...@drive.ai> wrote:
>
>> sorry, to clarify, i was using --executor-memory for memory,
>> and --total-executor-cores for cpu cores
>>
>> On Thu, Feb 2, 2017 at 12:56 PM, Michael Gummelt <mgumm...@mesosphere.io>
>> wrote:
>>
>>> What CLI args are your referring to?  I'm aware of spark-submit's
>>> arguments (--executor-memory, --total-executor-cores, and --executor-cores)
>>>
>>> On Thu, Feb 2, 2017 at 12:41 PM, Ji Yan <ji...@drive.ai> wrote:
>>>
>>>> I have done a experiment on this today. It shows that only CPUs are
>>>> tolerant of insufficient cluster size when a job starts. On my cluster, I
>>>> have 180Gb of memory and 64 cores, when I run spark-submit ( on mesos )
>>>> with --cpu_cores set to 1000, the job starts up with 64 cores. but when I
>>>> set --memory to 200Gb, the job fails to start with "Initial job has
>>>> not accepted any resources; check your cluster UI to ensure that workers
>>>> are registered and have sufficient resources"
>>>>
>>>> Also it is confusing to me that --cpu_cores specifies the number of cpu
>>>> cores across all executors, but --memory specifies per executor memory
>>>> requirement.
>>>>
>>>> On Mon, Jan 30, 2017 at 11:34 AM, Michael Gummelt <
>>>> mgumm...@mesosphere.io> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Mon, Jan 30, 2017 at 9:47 AM, Ji Yan <ji...@drive.ai> wrote:
>>>>>
>>>>>> Tasks begin scheduling as soon as the first executor comes up
>>>>>>
>>>>>>
>>>>>> Thanks all for the clarification. Is this the default behavior of
>>>>>> Spark on Mesos today? I think this is what we are looking for because
>>>>>> sometimes a job can take up lots of resources and later jobs could not 
>>>>>> get
>>>>>> all the resources that it asks for. If a Spark job starts with only a
>>>>>> subset of resources that it asks for, does it know to expand its 
>>>>>> resources
>>>>>> later when more resources become available?
>>>>>>
>>>>>
>>>>> Yes.
>>>>>
>>>>>
>>>>>>
>>>>>> Launch each executor with at least 1GB RAM, but if mesos offers 2GB
>>>>>>> at some moment, then launch an executor with 2GB RAM
>>>>>>
>>>>>>
>>>>>> This is less useful in our use case. But I am also quite interested
>>>>>> in cases in which this could be helpful. I think this will also help with
>>>>>> overall resource utilization on the cluster if when another job starts up
>>>>>> that has a hard requirement on resources, the extra resources to the 
>>>>>> first
>>>>>> job can be flexibly re-allocated to the second job.
>>>>>>
>>>>>> On Sat, Jan 28, 2017 at 2:32 PM, Michael Gummelt <
>>>>>> mgumm...@mesosphere.io> wrote:
>>>>>>
>>>>>>> We've talked about that, but it hasn't become a priority because we
>>>>>>> haven't had a driving use case.  If anyone has a good argument for
>>>>>>> "variable" resource allocation like this, please let me know.
>>>>>>>
>>>>>>> On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin <linshuai2...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> An alternative behavior is to launch the job with the best resource
>>>>>>>>> offer Mesos is able to give
>>>>>>>>
>>>>>>>>
>>>>>>>> Michael has just made an excellent explanation about dynamic
>>>>>>>> allocation support in mesos. But IIUC, what you want to achieve is
>>>>>>>> something like (using RAM as an example) : "Launch each executor with 
>>>>>>>

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
sorry, to clarify, i was using --executor-memory for memory,
and --total-executor-cores for cpu cores

On Thu, Feb 2, 2017 at 12:56 PM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> What CLI args are your referring to?  I'm aware of spark-submit's
> arguments (--executor-memory, --total-executor-cores, and --executor-cores)
>
> On Thu, Feb 2, 2017 at 12:41 PM, Ji Yan <ji...@drive.ai> wrote:
>
>> I have done a experiment on this today. It shows that only CPUs are
>> tolerant of insufficient cluster size when a job starts. On my cluster, I
>> have 180Gb of memory and 64 cores, when I run spark-submit ( on mesos )
>> with --cpu_cores set to 1000, the job starts up with 64 cores. but when I
>> set --memory to 200Gb, the job fails to start with "Initial job has not
>> accepted any resources; check your cluster UI to ensure that workers are
>> registered and have sufficient resources"
>>
>> Also it is confusing to me that --cpu_cores specifies the number of cpu
>> cores across all executors, but --memory specifies per executor memory
>> requirement.
>>
>> On Mon, Jan 30, 2017 at 11:34 AM, Michael Gummelt <mgumm...@mesosphere.io
>> > wrote:
>>
>>>
>>>
>>> On Mon, Jan 30, 2017 at 9:47 AM, Ji Yan <ji...@drive.ai> wrote:
>>>
>>>> Tasks begin scheduling as soon as the first executor comes up
>>>>
>>>>
>>>> Thanks all for the clarification. Is this the default behavior of Spark
>>>> on Mesos today? I think this is what we are looking for because sometimes a
>>>> job can take up lots of resources and later jobs could not get all the
>>>> resources that it asks for. If a Spark job starts with only a subset of
>>>> resources that it asks for, does it know to expand its resources later when
>>>> more resources become available?
>>>>
>>>
>>> Yes.
>>>
>>>
>>>>
>>>> Launch each executor with at least 1GB RAM, but if mesos offers 2GB at
>>>>> some moment, then launch an executor with 2GB RAM
>>>>
>>>>
>>>> This is less useful in our use case. But I am also quite interested in
>>>> cases in which this could be helpful. I think this will also help with
>>>> overall resource utilization on the cluster if when another job starts up
>>>> that has a hard requirement on resources, the extra resources to the first
>>>> job can be flexibly re-allocated to the second job.
>>>>
>>>> On Sat, Jan 28, 2017 at 2:32 PM, Michael Gummelt <
>>>> mgumm...@mesosphere.io> wrote:
>>>>
>>>>> We've talked about that, but it hasn't become a priority because we
>>>>> haven't had a driving use case.  If anyone has a good argument for
>>>>> "variable" resource allocation like this, please let me know.
>>>>>
>>>>> On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin <linshuai2...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> An alternative behavior is to launch the job with the best resource
>>>>>>> offer Mesos is able to give
>>>>>>
>>>>>>
>>>>>> Michael has just made an excellent explanation about dynamic
>>>>>> allocation support in mesos. But IIUC, what you want to achieve is
>>>>>> something like (using RAM as an example) : "Launch each executor with at
>>>>>> least 1GB RAM, but if mesos offers 2GB at some moment, then launch an
>>>>>> executor with 2GB RAM".
>>>>>>
>>>>>> I wonder what's benefit of that? To reduce the "resource
>>>>>> fragmentation"?
>>>>>>
>>>>>> Anyway, that is not supported at this moment. In all the supported
>>>>>> cluster managers of spark (mesos, yarn, standalone, and the up-to-coming
>>>>>> spark on kubernetes), you have to specify the cores and memory of each
>>>>>> executor.
>>>>>>
>>>>>> It may not be supported in the future, because only mesos has the
>>>>>> concepts of offers because of its two-level scheduling model.
>>>>>>
>>>>>>
>>>>>> On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan <ji...@drive.ai> wrote:
>>>>>>
>>>>>>> Dear Spark Users,
>>>>>>>
>>>>>>> Currently is there a way to dynamically allocate res

Re: Dynamic resource allocation to Spark on Mesos

2017-02-02 Thread Ji Yan
I have done a experiment on this today. It shows that only CPUs are
tolerant of insufficient cluster size when a job starts. On my cluster, I
have 180Gb of memory and 64 cores, when I run spark-submit ( on mesos )
with --cpu_cores set to 1000, the job starts up with 64 cores. but when I
set --memory to 200Gb, the job fails to start with "Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources"

Also it is confusing to me that --cpu_cores specifies the number of cpu
cores across all executors, but --memory specifies per executor memory
requirement.

On Mon, Jan 30, 2017 at 11:34 AM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

>
>
> On Mon, Jan 30, 2017 at 9:47 AM, Ji Yan <ji...@drive.ai> wrote:
>
>> Tasks begin scheduling as soon as the first executor comes up
>>
>>
>> Thanks all for the clarification. Is this the default behavior of Spark
>> on Mesos today? I think this is what we are looking for because sometimes a
>> job can take up lots of resources and later jobs could not get all the
>> resources that it asks for. If a Spark job starts with only a subset of
>> resources that it asks for, does it know to expand its resources later when
>> more resources become available?
>>
>
> Yes.
>
>
>>
>> Launch each executor with at least 1GB RAM, but if mesos offers 2GB at
>>> some moment, then launch an executor with 2GB RAM
>>
>>
>> This is less useful in our use case. But I am also quite interested in
>> cases in which this could be helpful. I think this will also help with
>> overall resource utilization on the cluster if when another job starts up
>> that has a hard requirement on resources, the extra resources to the first
>> job can be flexibly re-allocated to the second job.
>>
>> On Sat, Jan 28, 2017 at 2:32 PM, Michael Gummelt <mgumm...@mesosphere.io>
>> wrote:
>>
>>> We've talked about that, but it hasn't become a priority because we
>>> haven't had a driving use case.  If anyone has a good argument for
>>> "variable" resource allocation like this, please let me know.
>>>
>>> On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin <linshuai2...@gmail.com>
>>> wrote:
>>>
>>>> An alternative behavior is to launch the job with the best resource
>>>>> offer Mesos is able to give
>>>>
>>>>
>>>> Michael has just made an excellent explanation about dynamic allocation
>>>> support in mesos. But IIUC, what you want to achieve is something like
>>>> (using RAM as an example) : "Launch each executor with at least 1GB RAM,
>>>> but if mesos offers 2GB at some moment, then launch an executor with 2GB
>>>> RAM".
>>>>
>>>> I wonder what's benefit of that? To reduce the "resource fragmentation"?
>>>>
>>>> Anyway, that is not supported at this moment. In all the supported
>>>> cluster managers of spark (mesos, yarn, standalone, and the up-to-coming
>>>> spark on kubernetes), you have to specify the cores and memory of each
>>>> executor.
>>>>
>>>> It may not be supported in the future, because only mesos has the
>>>> concepts of offers because of its two-level scheduling model.
>>>>
>>>>
>>>> On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan <ji...@drive.ai> wrote:
>>>>
>>>>> Dear Spark Users,
>>>>>
>>>>> Currently is there a way to dynamically allocate resources to Spark on
>>>>> Mesos? Within Spark we can specify the CPU cores, memory before running
>>>>> job. The way I understand is that the Spark job will not run if the 
>>>>> CPU/Mem
>>>>> requirement is not met. This may lead to decrease in overall utilization 
>>>>> of
>>>>> the cluster. An alternative behavior is to launch the job with the best
>>>>> resource offer Mesos is able to give. Is this possible with the current
>>>>> implementation?
>>>>>
>>>>> Thanks
>>>>> Ji
>>>>>
>>>>> The information in this email is confidential and may be legally
>>>>> privileged. It is intended solely for the addressee. Access to this email
>>>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>>>> disclosure, copying, distribution or any action taken or omitted to be
>>>>> taken in reliance on it, is prohibited and ma

Re: Dynamic resource allocation to Spark on Mesos

2017-01-30 Thread Ji Yan
>
> Tasks begin scheduling as soon as the first executor comes up


Thanks all for the clarification. Is this the default behavior of Spark on
Mesos today? I think this is what we are looking for because sometimes a
job can take up lots of resources and later jobs could not get all the
resources that it asks for. If a Spark job starts with only a subset of
resources that it asks for, does it know to expand its resources later when
more resources become available?

Launch each executor with at least 1GB RAM, but if mesos offers 2GB at some
> moment, then launch an executor with 2GB RAM


This is less useful in our use case. But I am also quite interested in
cases in which this could be helpful. I think this will also help with
overall resource utilization on the cluster if when another job starts up
that has a hard requirement on resources, the extra resources to the first
job can be flexibly re-allocated to the second job.

On Sat, Jan 28, 2017 at 2:32 PM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> We've talked about that, but it hasn't become a priority because we
> haven't had a driving use case.  If anyone has a good argument for
> "variable" resource allocation like this, please let me know.
>
> On Sat, Jan 28, 2017 at 9:17 AM, Shuai Lin <linshuai2...@gmail.com> wrote:
>
>> An alternative behavior is to launch the job with the best resource offer
>>> Mesos is able to give
>>
>>
>> Michael has just made an excellent explanation about dynamic allocation
>> support in mesos. But IIUC, what you want to achieve is something like
>> (using RAM as an example) : "Launch each executor with at least 1GB RAM,
>> but if mesos offers 2GB at some moment, then launch an executor with 2GB
>> RAM".
>>
>> I wonder what's benefit of that? To reduce the "resource fragmentation"?
>>
>> Anyway, that is not supported at this moment. In all the supported
>> cluster managers of spark (mesos, yarn, standalone, and the up-to-coming
>> spark on kubernetes), you have to specify the cores and memory of each
>> executor.
>>
>> It may not be supported in the future, because only mesos has the
>> concepts of offers because of its two-level scheduling model.
>>
>>
>> On Sat, Jan 28, 2017 at 1:35 AM, Ji Yan <ji...@drive.ai> wrote:
>>
>>> Dear Spark Users,
>>>
>>> Currently is there a way to dynamically allocate resources to Spark on
>>> Mesos? Within Spark we can specify the CPU cores, memory before running
>>> job. The way I understand is that the Spark job will not run if the CPU/Mem
>>> requirement is not met. This may lead to decrease in overall utilization of
>>> the cluster. An alternative behavior is to launch the job with the best
>>> resource offer Mesos is able to give. Is this possible with the current
>>> implementation?
>>>
>>> Thanks
>>> Ji
>>>
>>> The information in this email is confidential and may be legally
>>> privileged. It is intended solely for the addressee. Access to this email
>>> by anyone else is unauthorized. If you are not the intended recipient, any
>>> disclosure, copying, distribution or any action taken or omitted to be
>>> taken in reliance on it, is prohibited and may be unlawful.
>>>
>>
>>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Dynamic resource allocation to Spark on Mesos

2017-01-27 Thread Ji Yan
Dear Spark Users,

Currently is there a way to dynamically allocate resources to Spark on
Mesos? Within Spark we can specify the CPU cores, memory before running
job. The way I understand is that the Spark job will not run if the CPU/Mem
requirement is not met. This may lead to decrease in overall utilization of
the cluster. An alternative behavior is to launch the job with the best
resource offer Mesos is able to give. Is this possible with the current
implementation?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Force mesos to provide GPUs to Spark

2017-01-20 Thread Ji Yan
Dear Spark Users,

With the latest version of Spark and Mesos with GPU support, is there a way
to guarantee a Spark job with specified number of GPUs? Currently the Spark
job sets "spark.mesos.gpus.max" to ask for GPU resources, however this is
an upper bound, which means that Spark will accept Mesos offer even if no
GPU is available to offer. Can we make this explicit as to guarantee GPU
resource offers like other resources i.e. CPU/Mem?

Thanks
Ji

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Re: launch spark on mesos within a docker container

2016-12-30 Thread Ji Yan
Thanks Timothy,

Setting these four environment variables as you suggested has got the Spark
running

LIBPROCESS_ADVERTISE_IP=LIBPROCESS_ADVERTISE_PORT=40286
LIBPROCESS_IP=0.0.0.0 LIBPROCESS_PORT=40286

After that, it seems that Spark cannot accept any offer from mesos. If I
run the same script outside the docker container, Spark can get resource
and the Spark job runs successfully to end.

Here is the mesos master log for running the Spark job inside the Docker
container

I1230 14:29:55.710973  9557 master.cpp:2500] Subscribing framework eval.py
with checkpointing disabled and capabilities [ GPU_RESOURCES ]

I1230 14:29:55.712379  9567 hierarchical.cpp:271] Added framework
993198d1-7393-4656-9f75-4f22702609d0-0251

I1230 14:29:55.713717  9550 master.cpp:5709] Sending 1 offers to framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286

I1230 14:29:55.829774  9549 master.cpp:3951] Processing DECLINE call for
offers: [ 993198d1-7393-4656-9f75-4f22702609d0-O1384 ] for framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286

I1230 14:30:01.055359  9569 http.cpp:381] HTTP GET for /master/state from
172.16.8.140:49406 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95
Safari/537.36'

I1230 14:30:01.457598  9553 master.cpp:5709] Sending 1 offers to framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286

I1230 14:30:01.463732  9542 master.cpp:3951] Processing DECLINE call for
offers: [ 993198d1-7393-4656-9f75-4f22702609d0-O1385 ] for framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286

I1230 14:30:02.300915  9562 http.cpp:381] HTTP GET for /master/state from
172.16.1.58:62629 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95
Safari/537.36'

I1230 14:30:03.847647  9553 http.cpp:381] HTTP GET for /master/state from
172.16.8.140:49406 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95
Safari/537.36'

I1230 14:30:04.431270  9551 http.cpp:381] HTTP GET for /master/state from
172.16.1.58:62629 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95
Safari/537.36'

I1230 14:30:07.465801  9549 master.cpp:5709] Sending 1 offers to framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286

I1230 14:30:07.470860  9542 master.cpp:3951] Processing DECLINE call for
offers: [ 993198d1-7393-4656-9f75-4f22702609d0-O1386 ] for framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286

I1230 14:30:11.077518  9572 http.cpp:381] HTTP GET for /master/state from
172.16.8.140:59764 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95
Safari/537.36'

I1230 14:30:12.387562  9560 http.cpp:381] HTTP GET for /master/state from
172.16.1.58:62629 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X
10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95
Safari/537.36'

I1230 14:30:12.473937  9572 master.cpp:5709] Sending 1 offers to framework
993198d1-7393-4656-9f75-4f22702609d0-0251 (eval.py) at
scheduler-9300fd07-7cf5-4341-84c9-4f1930e8c145@172.16.1.101:40286


On Fri, Dec 30, 2016 at 1:35 PM, Timothy Chen <tnac...@gmail.com> wrote:

> Hi Ji,
>
> One way to make it fixed is to set LIBPROCESS_PORT environment variable on
> the executor when it is launched.
>
> Tim
>
>
> On Dec 30, 2016, at 1:23 PM, Ji Yan <ji...@drive.ai> wrote:
>
> Dear Spark Users,
>
> We are trying to launch Spark on Mesos from within a docker container. We
> have found that since the Spark executors need to talk back at the Spark
> driver, there is need to do a lot of port mapping to make that happen. We
> seemed to have mapped the ports on what we could find from the
> documentation page on spark configuration.
>
> spark-2.1.0-bin-spark-2.1/bin/spark-submit \
>>   --conf 'spark.driver.host'= \
>>   --conf 'spark.blockManager.port'='40285' \
>>   --conf 'spark.driver.bindAddress'='0.0.0.0' \
>>   --conf 'spark.driver.port'='40284' \
>>   --conf 'spark.mesos.executor.docker.volumes'='spark-2.1.0-bin-
>> spark-2.1:/spark-2.1.0-bin-spark-2.1' \
>>   --conf 'spark.mesos.gpus.max'='2' \
>>   --conf 'spark.mesos.containerizer'='docker' \
>>   --conf 'spark.mesos.executor.docker.image'='docker.drive.ai/spark_
>> gpu_experiment:latest' \

launch spark on mesos within a docker container

2016-12-30 Thread Ji Yan
Dear Spark Users,

We are trying to launch Spark on Mesos from within a docker container. We
have found that since the Spark executors need to talk back at the Spark
driver, there is need to do a lot of port mapping to make that happen. We
seemed to have mapped the ports on what we could find from the
documentation page on spark configuration.

spark-2.1.0-bin-spark-2.1/bin/spark-submit \
>   --conf 'spark.driver.host'= \
>   --conf 'spark.blockManager.port'='40285' \
>   --conf 'spark.driver.bindAddress'='0.0.0.0' \
>   --conf 'spark.driver.port'='40284' \
>   --conf 'spark.mesos.executor.docker.volumes'='
> spark-2.1.0-bin-spark-2.1:/spark-2.1.0-bin-spark-2.1' \
>   --conf 'spark.mesos.gpus.max'='2' \
>   --conf 'spark.mesos.containerizer'='docker' \
>   --conf 'spark.mesos.executor.docker.image'='
> docker.drive.ai/spark_gpu_experiment:latest' \
>   --master 'mesos://mesos_master_dev:5050' \
>   -v eval.py


When we launched Spark this way, from the Mesos master log. It seems that
the mesos master is trying to make the offer back to the framework at port
33978 which turns out to be a dynamic port. The job failed at this point
because it looks like that the offer cannot reach back to the container. In
order to expose that port in the container, we'll need to make it fixed
first, does anyone know how to make that port fixed in spark configuration?
Any other advice on how to launch Spark on mesos from within docker
container is greatly appreciated

I1230 12:53:54.758297  9571 master.cpp:2424] Received SUBSCRIBE call
for framework 'eval.py' at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
I1230 12:53:54.758608  9571 master.cpp:2500] Subscribing framework
eval.py with checkpointing disabled and capabilities [ GPU_RESOURCES ]
I1230 12:53:54.760036  9569 hierarchical.cpp:271] Added framework
993198d1-7393-4656-9f75-4f22702609d0-0233I1230 12:53:54.761533  9549
master.cpp:5709] Sending 1 offers to framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@:33978
E1230 12:53:57.757814  9573 process.cpp:2105] Failed to shutdown
socket with fd 22: Transport endpoint is not connectedI1230
12:53:57.758314  9543 master.cpp:1284] Framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
disconnected
I1230 12:53:57.758378  9543 master.cpp:2725] Disconnecting framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
I1230 12:53:57.758411  9543 master.cpp:2749] Deactivating framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
I1230 12:53:57.758582  9548 hierarchical.cpp:382] Deactivated
framework 993198d1-7393-4656-9f75-4f22702609d0-0233
W1230 12:53:57.758915  9543 master.hpp:2113] Master attempted to send
message to disconnected framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
I1230 12:53:57.759140  9543 master.cpp:1297] Giving framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978 0ns
to failover
I1230 12:53:57.760573  9561 master.cpp:5561] Framework failover
timeout, removing framework 993198d1-7393-4656-9f75-4f22702609d0-0233
(eval.py) at scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
I1230 12:53:57.760648  9561 master.cpp:6296] Removing framework
993198d1-7393-4656-9f75-4f22702609d0-0233 (eval.py) at
scheduler-8a94bc86-c2b3-4c7d-bee7-cfddc8e9a8da@172.17.0.12:33978
I1230 12:53:57.761493  9571 hierarchical.cpp:333] Removed framework
993198d1-7393-4656-9f75-4f22702609d0-0233

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Re: Spark/Mesos with GPU support

2016-12-30 Thread Ji Yan
Thanks Michael, Tim and I have touched base and thankfully the issue has
already been resolved

On Fri, Dec 30, 2016 at 9:20 AM, Michael Gummelt <mgumm...@mesosphere.io>
wrote:

> I've cc'd Tim and Kevin, who worked on GPU support.
>
> On Wed, Dec 28, 2016 at 11:22 AM, Ji Yan <ji...@drive.ai> wrote:
>
>> Dear Spark Users,
>>
>> Has anyone had successful experience running Spark on Mesos with GPU
>> support? We have a Mesos cluster that can see and offer nvidia GPU
>> resources. With Spark, it seems that the GPU support with Mesos (
>> https://github.com/apache/spark/pull/14644) has only recently been
>> merged into Spark Master which is not found in 2.0.2 release yet. We have a
>> custom built Spark from 2.1-rc5 which is confirmed to have the above
>> change. However when we try to run any code from Spark on this Mesos setup,
>> the spark program hangs and keeps saying
>>
>> “WARN TaskSchedulerImpl: Initial job has not accepted any resources;
>> check your cluster UI to ensure that workers are registered and have
>> sufficient resources”
>>
>> We are pretty sure that the cluster has enough resources as there is
>> nothing running on it. If we disable the GPU support in configuration and
>> restart mesos and retry the same program, it would work.
>>
>> Any comment/advice on this greatly appreciated
>>
>> Thanks,
>> Ji
>>
>>
>> The information in this email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful.
>>
>
>
>
> --
> Michael Gummelt
> Software Engineer
> Mesosphere
>

-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.


Spark/Mesos with GPU support

2016-12-28 Thread Ji Yan
Dear Spark Users,

Has anyone had successful experience running Spark on Mesos with GPU support? 
We have a Mesos cluster that can see and offer nvidia GPU resources. With 
Spark, it seems that the GPU support with Mesos 
(https://github.com/apache/spark/pull/14644 
) has only recently been merged 
into Spark Master which is not found in 2.0.2 release yet. We have a custom 
built Spark from 2.1-rc5 which is confirmed to have the above change. However 
when we try to run any code from Spark on this Mesos setup, the spark program 
hangs and keeps saying

“WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your 
cluster UI to ensure that workers are registered and have sufficient resources”

We are pretty sure that the cluster has enough resources as there is nothing 
running on it. If we disable the GPU support in configuration and restart mesos 
and retry the same program, it would work.

Any comment/advice on this greatly appreciated

Thanks,
Ji


-- 
 

The information in this email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this email 
by anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be 
taken in reliance on it, is prohibited and may be unlawful.