Anyone can share any thoughts related to my questions?
Thanks
From: java8...@hotmail.com
To: user@spark.apache.org
Subject: Help me understand the partition, parallelism in Spark
Date: Wed, 25 Feb 2015 21:58:55 -0500
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand
Hi Yong,
mostly correct except for:
- Since we are doing reduceByKey, shuffling will happen. Data will be
shuffled into 1000 partitions, as we have 1000 unique keys.
no, you will not get 1000 partitions. Spark has to decide how many
partitions to use before it even knows how many
lower memory usage vs speed. Hope my
understanding is correct.
Thanks
Yong
Date: Thu, 26 Feb 2015 17:03:20 -0500
Subject: Re: Help me understand the partition, parallelism in Spark
From: yana.kadiy...@gmail.com
To: iras...@cloudera.com
CC: java8...@hotmail.com; user@spark.apache.org
Imran, I have
Imran, I have also observed the phenomenon of reducing the cores helping
with OOM. I wanted to ask this (hopefully without straying off topic): we
can specify the number of cores and the executor memory. But we don't get
to specify _how_ the cores are spread among executors.
Is it possible that
Here is my understanding.
When running on top of yarn, the cores means the number of tasks can run in one
executor. But all these cores are located in the same JVM.
Parallelism typically control the balance of tasks. For example, if you have
200 cores, but only 50 partitions. There will be 150
--
Date: Thu, 26 Feb 2015 17:03:20 -0500
Subject: Re: Help me understand the partition, parallelism in Spark
From: yana.kadiy...@gmail.com
To: iras...@cloudera.com
CC: java8...@hotmail.com; user@spark.apache.org
Imran, I have also observed the phenomenon of reducing the cores
Hi, Sparkers:
I come from the Hadoop MapReducer world, and try to understand some internal
information of spark. From the web and this list, I keep seeing people talking
about increase the parallelism if you get the OOM error. I tried to read
document as much as possible to understand the RDD