Spark allows you configure the resources for the worker process. If I remember it correctly, you can use SPARK_DAEMON_MEMORY to control memory allocated to the worker process.
#1 below is more appropriate if you will be running just one application at a time. 32GB heap size is still too high. Depending on the garbage collector, you may see long pauses. Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Simone Franzini [mailto:captainfr...@gmail.com] Sent: Wednesday, May 4, 2016 7:40 AM To: user Subject: Re: Spark standalone workers, executors and JVMs Hi Mohammed, Thanks for your reply. I agree with you, however a single application can use multiple executors as well, so I am still not clear which option is best. Let me make an example to be a little more concrete. Let's say I am only running a single application. Let's assume again that I have 192GB of memory and 24 cores on each node. Which one of the following two options is best and why: 1. Running 6 workers with 32GB each and 1 executor/worker (i.e. set SPARK_WORKER_INSTANCES=6, leave spark.executor.cores to its default, which is to assign all available cores to an executor in standalone mode). 2. Running 1 worker with 192GB memory and 6 executors/worker (i.e. SPARK_WORKER_INSTANCES=1 and spark.executor.cores=5, spark.executor.memory=32GB). Also one more question. I understand that workers and executors are different processes. How many resources is the worker process actually using and how do I set those? As far as I understand the worker does not need many resources, as it is only spawning up executors. Is that correct? Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini On Mon, May 2, 2016 at 7:47 PM, Mohammed Guller <moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote: The workers and executors run as separate JVM processes in the standalone mode. The use of multiple workers on a single machine depends on how you will be using the clusters. If you run multiple Spark applications simultaneously, each application gets its own its executor. So, for example, if you allocate 8GB to each application, you can run 192/8 Spark applications simultaneously (assuming you also have large number of cores). Each executor has only 8GB heap, so GC should not be issue. Alternatively, if you know that you will have few applications running simultaneously on that cluster, running multiple workers on each machine will allow you to avoid GC issues associated with allocating large heap to a single JVM process. This option allows you to run multiple executors for an application on a single machine and each executor can be configured with optimal memory. Mohammed Author: Big Data Analytics with Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/> From: Simone Franzini [mailto:captainfr...@gmail.com<mailto:captainfr...@gmail.com>] Sent: Monday, May 2, 2016 9:27 AM To: user Subject: Fwd: Spark standalone workers, executors and JVMs I am still a little bit confused about workers, executors and JVMs in standalone mode. Are worker processes and executors independent JVMs or do executors run within the worker JVM? I have some memory-rich nodes (192GB) and I would like to avoid deploying massive JVMs due to well known performance issues (GC and such). As of Spark 1.4 it is possible to either deploy multiple workers (SPARK_WORKER_INSTANCES + SPARK_WORKER_CORES) or multiple executors per worker (--executor-cores). Which option is preferable and why? Thanks, Simone Franzini, PhD http://www.linkedin.com/in/simonefranzini