RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
Anyone can share any thoughts related to my questions? Thanks From: java8...@hotmail.com To: user@spark.apache.org Subject: Help me understand the partition, parallelism in Spark Date: Wed, 25 Feb 2015 21:58:55 -0500 Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Imran Rashid
Hi Yong, mostly correct except for: - Since we are doing reduceByKey, shuffling will happen. Data will be shuffled into 1000 partitions, as we have 1000 unique keys. no, you will not get 1000 partitions. Spark has to decide how many partitions to use before it even knows how many

RE: Help me understand the partition, parallelism in Spark

2015-02-26 Thread java8964
lower memory usage vs speed. Hope my understanding is correct. Thanks Yong Date: Thu, 26 Feb 2015 17:03:20 -0500 Subject: Re: Help me understand the partition, parallelism in Spark From: yana.kadiy...@gmail.com To: iras...@cloudera.com CC: java8...@hotmail.com; user@spark.apache.org Imran, I have

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
Imran, I have also observed the phenomenon of reducing the cores helping with OOM. I wanted to ask this (hopefully without straying off topic): we can specify the number of cores and the executor memory. But we don't get to specify _how_ the cores are spread among executors. Is it possible that

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Zhan Zhang
Here is my understanding. When running on top of yarn, the cores means the number of tasks can run in one executor. But all these cores are located in the same JVM. Parallelism typically control the balance of tasks. For example, if you have 200 cores, but only 50 partitions. There will be 150

Re: Help me understand the partition, parallelism in Spark

2015-02-26 Thread Yana Kadiyska
-- Date: Thu, 26 Feb 2015 17:03:20 -0500 Subject: Re: Help me understand the partition, parallelism in Spark From: yana.kadiy...@gmail.com To: iras...@cloudera.com CC: java8...@hotmail.com; user@spark.apache.org Imran, I have also observed the phenomenon of reducing the cores

Help me understand the partition, parallelism in Spark

2015-02-25 Thread java8964
Hi, Sparkers: I come from the Hadoop MapReducer world, and try to understand some internal information of spark. From the web and this list, I keep seeing people talking about increase the parallelism if you get the OOM error. I tried to read document as much as possible to understand the RDD