Re: Limit pyspark.daemon threads

2016-06-17 Thread agateaaa
There is only one executor on each worker. I see one pyspark.daemon, but when the streaming jobs starts a batch I see that it spawns 4 other pyspark.daemon processes. After the batch completes, the 4 pyspark.demon processes die and there is only one left. I think this behavior was introduced by

Re: Limit pyspark.daemon threads

2016-06-15 Thread Jeff Zhang
>>> I am seeing this issue too with pyspark (Using Spark 1.6.1). I have set spark.executor.cores to 1, but I see that whenever streaming batch starts processing data, see python -m pyspark.daemon processes increase gradually to about 5, (increasing CPU% on a box about 4-5 times, each

Re: Limit pyspark.daemon threads

2016-06-15 Thread Sudhir Babu Pothineni
Hi Ken, It may be also related to Grid Engine job scheduling? If it is 16 core (virtual cores?), grid engine allocates 16 slots, If you use 'max' scheduling, it will send 16 processes sequentially to same machine, on the top of it each spark job has its own executors. Limit the number of jobs

Re: Limit pyspark.daemon threads

2016-06-15 Thread agateaaa
PM > *To:* Gene Pang > *Cc:* Sven Krasser; Carlile, Ken; user > *Subject:* Re: Limit pyspark.daemon threads > > > > Thx Gene! But my concern is with CPU usage not memory. I want to see if > there is anyway to control the number of pyspark.daemon processes that get > spawned. We

RE: Limit pyspark.daemon threads

2016-06-15 Thread David Newberger
ate...@gmail.com] Sent: Wednesday, June 15, 2016 4:39 PM To: Gene Pang Cc: Sven Krasser; Carlile, Ken; user Subject: Re: Limit pyspark.daemon threads Thx Gene! But my concern is with CPU usage not memory. I want to see if there is anyway to control the number of pyspark.daemon processes that get

Re: Limit pyspark.daemon threads

2016-06-15 Thread agateaaa
Thx Gene! But my concern is with CPU usage not memory. I want to see if there is anyway to control the number of pyspark.daemon processes that get spawned. We have some restriction on number of CPU's we can use on a node, and number of pyspark.daemon processes that get created dont seem to honor

Re: Limit pyspark.daemon threads

2016-06-15 Thread Gene Pang
As Sven mentioned, you can use Alluxio to store RDDs in off-heap memory, and you can then share that RDD across different jobs. If you would like to run Spark on Alluxio, this documentation can help: http://www.alluxio.org/documentation/master/en/Running-Spark-on-Alluxio.html Thanks, Gene On

Re: Limit pyspark.daemon threads

2016-06-14 Thread agateaaa
Hi, I am seeing this issue too with pyspark (Using Spark 1.6.1). I have set spark.executor.cores to 1, but I see that whenever streaming batch starts processing data, see python -m pyspark.daemon processes increase gradually to about 5, (increasing CPU% on a box about 4-5 times, each

Re: Limit pyspark.daemon threads

2016-03-26 Thread Sven Krasser
Hey Ken, 1. You're correct, cached RDDs live on the JVM heap. (There's an off-heap storage option using Alluxio, formerly Tachyon, with which I have no experience however.) 2. The worker memory setting is not a hard maximum unfortunately. What happens is that during aggregation the Python daemon

Re: Limit pyspark.daemon threads

2016-03-26 Thread Carlile, Ken
This is extremely helpful! I’ll have to talk to my users about how the python memory limit should be adjusted and what their expectations are. I’m fairly certain we bumped it up in the dark past when jobs were failing because of insufficient memory for the python processes.  So

Re: Limit pyspark.daemon threads

2016-03-26 Thread Sven Krasser
My understanding is that the spark.executor.cores setting controls the number of worker threads in the executor in the JVM. Each worker thread communicates then with a pyspark daemon process (these are not threads) to stream data into Python. There should be one daemon process per worker thread

Re: Limit pyspark.daemon threads

2016-03-26 Thread Carlile, Ken
Thanks, Sven!  I know that I’ve messed up the memory allocation, but I’m trying not to think too much about that (because I’ve advertised it to my users as “90GB for Spark works!” and that’s how it displays in the Spark UI (totally ignoring the python processes). So I’ll need to deal

Re: Limit pyspark.daemon threads

2016-03-25 Thread Sven Krasser
Hey Ken, I also frequently see more pyspark daemons than configured concurrency, often it's a low multiple. (There was an issue pre-1.3.0 that caused this to be quite a bit higher, so make sure you at least have a recent version; see SPARK-5395.) Each pyspark daemon tries to stay below the

Re: Limit pyspark.daemon threads

2016-03-25 Thread Carlile, Ken
Further data on this.  I’m watching another job right now where there are 16 pyspark.daemon threads, all of which are trying to get a full core (remember, this is a 16 core machine). Unfortunately , the java process actually running the spark worker is trying to take several cores of its

Re: Limit pyspark.daemon threads

2016-03-21 Thread Carlile, Ken
No further input on this? I discovered today that the pyspark.daemon threadcount was actually 48, which makes a little more sense (at least it’s a multiple of 16), and it seems to be happening at reduce and collect portions of the code.  —Ken On Mar 17, 2016, at 10:51 AM,

Re: Limit pyspark.daemon threads

2016-03-20 Thread Ted Yu
I took a look at docs/configuration.md Though I didn't find answer for your first question, I think the following pertains to your second question: spark.python.worker.memory 512m Amount of memory to use per python worker process during aggregation, in the same format as JVM

Re: Limit pyspark.daemon threads

2016-03-18 Thread Carlile, Ken
Thanks! I found that part just after I sent the email… whoops. I’m guessing that’s not an issue for my users, since it’s been set that way for a couple of years now.  The thread count is definitely an issue, though, since if enough nodes go down, they can’t schedule their spark

Limit pyspark.daemon threads

2016-03-18 Thread Carlile, Ken
Hello, We have an HPC cluster that we run Spark jobs on using standalone mode and a number of scripts I’ve built up to dynamically schedule and start spark clusters within the Grid Engine framework. Nodes in the cluster have 16 cores and 128GB of RAM. My users use pyspark heavily. We’ve

?????? Limit pyspark.daemon threads

2016-03-18 Thread Sea
hhmi.org>; : "user"<user@spark.apache.org>; : Re: Limit pyspark.daemon threads I took a look at docs/configuration.md Though I didn't find answer for your first question, I think the following pertains to your second question: spark.python.worker.memory 512m A