How to change the parallelism level of input dstreams
Dear list, A quick question about spark streaming: Say I have this stage set up in my Spark Streaming cluster: batched TCP stream == map(expensive computation) === ReduceByKey I know I can set the number of tasks for ReduceByKey. But I didn't find a place to specify the parallelism for the input dstream(RDD sequence generated after the TCP stream). Do I need to explicitly call repartition() to split the input RDD streams into many parititions? If that is the case, what is the mechanism used to split the RDD stream? Random fully reparation on each (K,V) pair (effectively a shuffle) or more like rebalance? And what is the default parallelism level for input stream? Thank you so much -Mo
Parallelism level
Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Parallelism level
If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Parallelism level
What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 -- Informativa sulla Privacy: http://www.unibs.it/node/8155
Re: Parallelism level
If you want more parallelism, you need more cores. So, use a machine with more cores, or use a cluster of machines. spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the easiest way to do this. If you're stuck on a single machine with 2 cores, then set your default parallelism to 2. Setting it to a higher number won't do anything helpful. On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: What do you advice me Nicholas? Em 4/4/14, 19:05, Nicholas Chammas escreveu: If you're running on one machine with 2 cores, I believe all you can get out of it are 2 concurrent tasks at any one time. So setting your default parallelism to 20 won't help. On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia e.costaalf...@unibs.it wrote: Hi all, I have put this line in my spark-env.sh: -Dspark.default.parallelism=20 this parallelism level, is it correct? The machine's processor is a dual core. Thanks -- Informativa sulla Privacy: http://www.unibs.it/node/8155 Informativa sulla Privacy: http://www.unibs.it/node/8155