Hello Jey, Thank you for answering. I have found that there are about 6 or 7 'daemon.py' processes in one worker node. Will each core have a 'daemon.py' process? How to decide how many 'daemon.py' processes in one worker node? I have also found that there are many spark related java process in a worker node, so if the java process on worker node is just responsible for communication, why spark needs so many java processes? Overall, I think the main problem I have for my program is the memory allocation. More specifically, in spark-env.sh, there are two options, * SPARK_DAEMON_MEMORY* and *SPARK_DAEMON_JAVA_OPTS*. I can also set up * spark.executor.memory* in SPARK_JAVA_OPTS. So if I have 68g memory in a worker node, how should I distribute memory for these options? At present, I use the default value for SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS and set spark.executor.memory to 20g. It seems that spark will add rdd to spark.executor.memory and I find that each 'daemon.py' will also consume about 7g memory. Now when running my program for a while, the program will use up all memory on a worker node and the master node will report connection errors. (I have 5 worker nodes, each has 8 cores) So I am a little confused about the jobs that the three options are responsible for and how to distribute memories to them. Any suggestion will be appreciated. Thanks!
Best, Shangyu 2013/10/8 Jey Kottalam <j...@cs.berkeley.edu> > Hi Shangyu, > > The daemon.py python process is the actual PySpark worker process, and > is launched by the Spark worker when running Python jobs. So, when > using PySpark, the "real computation" is handled by a python process > (via daemon.py), not a java process. > > Hope that helps, > -Jey > > On Mon, Oct 7, 2013 at 9:50 PM, Shangyu Luo <lsy...@gmail.com> wrote: > > Hello! > > I am using Spark 0.7.3 with python version. Recently when I run some > spark > > program on a cluster, I found that some processes invoked by > > spark-0.7.3/python/pyspark/daemon.py would capturing CPU for a long time > and > > consume much memory (e.g., 5g for each process). It seemed that the java > > process, which was invoked by > > java -cp > > > :/usr/lib/spark-0.7.3/conf:/usr/lib/spark-0.7.3/core/target/scala-2.9.3/classes > > ... , was 'competing' with the daemon.py for CPU resources. From my > > understanding, the java process should be responsible for the 'real' > > computation in spark. > > So I am wondering what job the daemon.py will work on? Is it normal for > it > > to consume a lot of CPU and memory? > > Thanks! > > > > > > Best, > > Shangyu Luo > > -- > > -- > > > > Shangyu, Luo > > Department of Computer Science > > Rice University > > > -- -- Shangyu, Luo Department of Computer Science Rice University -- Not Just Think About It, But Do It! -- Success is never final. -- Losers always whine about their best