How to change the parallelism level of input dstreams

2014-04-09 Thread Dong Mo
 Dear list,

A quick question about spark streaming:

Say I have this stage set up in my Spark Streaming cluster:

batched TCP stream == map(expensive computation) === ReduceByKey

I know I can set the number of tasks for ReduceByKey.

But I didn't find a place to specify the parallelism for the input
dstream(RDD sequence generated after the TCP stream). Do I need to
explicitly call repartition() to split the input RDD streams into many
parititions? If that is the case, what is the mechanism used to split the
RDD stream? Random fully reparation on each (K,V) pair (effectively a
shuffle) or more like rebalance?
And what is the default parallelism level for input stream?

Thank you so much
-Mo


Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you're running on one machine with 2 cores, I believe all you can get
out of it are 2 concurrent tasks at any one time. So setting your default
parallelism to 20 won't help.


On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
e.costaalf...@unibs.it wrote:

 Hi all,

 I have put this line in my spark-env.sh:
 -Dspark.default.parallelism=20

  this parallelism level, is it correct?
  The machine's processor is a dual core.

 Thanks

 --
 Informativa sulla Privacy: http://www.unibs.it/node/8155



Re: Parallelism level

2014-04-04 Thread Eduardo Costa Alfaia

What do you advice me Nicholas?

Em 4/4/14, 19:05, Nicholas Chammas escreveu:
If you're running on one machine with 2 cores, I believe all you can 
get out of it are 2 concurrent tasks at any one time. So setting your 
default parallelism to 20 won't help.



On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
e.costaalf...@unibs.it mailto:e.costaalf...@unibs.it wrote:


Hi all,

I have put this line in my spark-env.sh:
-Dspark.default.parallelism=20

 this parallelism level, is it correct?
 The machine's processor is a dual core.

Thanks

-- 
Informativa sulla Privacy: http://www.unibs.it/node/8155






--
Informativa sulla Privacy: http://www.unibs.it/node/8155


Re: Parallelism level

2014-04-04 Thread Nicholas Chammas
If you want more parallelism, you need more cores. So, use a machine with
more cores, or use a cluster of machines.
spark-ec2https://spark.apache.org/docs/latest/ec2-scripts.htmlis the
easiest way to do this.

If you're stuck on a single machine with 2 cores, then set your default
parallelism to 2. Setting it to a higher number won't do anything helpful.


On Fri, Apr 4, 2014 at 2:47 PM, Eduardo Costa Alfaia e.costaalf...@unibs.it
 wrote:

  What do you advice me Nicholas?

 Em 4/4/14, 19:05, Nicholas Chammas escreveu:

 If you're running on one machine with 2 cores, I believe all you can get
 out of it are 2 concurrent tasks at any one time. So setting your default
 parallelism to 20 won't help.


 On Fri, Apr 4, 2014 at 11:41 AM, Eduardo Costa Alfaia 
 e.costaalf...@unibs.it wrote:

 Hi all,

 I have put this line in my spark-env.sh:
 -Dspark.default.parallelism=20

  this parallelism level, is it correct?
  The machine's processor is a dual core.

 Thanks

 --
 Informativa sulla Privacy: http://www.unibs.it/node/8155




 Informativa sulla Privacy: http://www.unibs.it/node/8155