Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-23 Thread Mike Trienis
Yup, and since I have only one core per executor it explains why there was
only one executor utilized. I'll need to investigate which EC2 instance
type is going to be the best fit.

Thanks Evo.

On Fri, May 22, 2015 at 3:47 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 A receiver occupies a cpu core, an executor is simply a jvm instance and
 as such it can be granted any number of cores and ram

 So check how many cores you have per executor


 Sent from Samsung Mobile


  Original message 
 From: Mike Trienis
 Date:2015/05/22 21:51 (GMT+00:00)
 To: user@spark.apache.org
 Subject: Re: Spark Streaming: all tasks running on one executor (Kinesis +
 Mongodb)

 I guess each receiver occupies a executor. So there was only one executor
 available for processing the job.

 On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com
 wrote:

 Hi All,

 I have cluster of four nodes (three workers and one master, with one core
 each) which consumes data from Kinesis at 15 second intervals using two
 streams (i.e. receivers). The job simply grabs the latest batch and pushes
 it to MongoDB. I believe that the problem is that all tasks are executed on
 a single worker node and never distributed to the others. This is true even
 after I set the number of concurrentJobs to 3. Overall, I would really like
 to increase throughput (i.e. more than 500 records / second) and understand
 why all executors are not being utilized.

 Here are some parameters I have set:

-
- spark.streaming.blockInterval   200
- spark.locality.wait 500
- spark.streaming.concurrentJobs  3

 This is the code that's actually doing the writing:

 def write(rdd: RDD[Data], time:Time) : Unit = {
 val result = doSomething(rdd, time)
 result.foreachPartition { i =
 i.foreach(record = connection.insert(record))
 }
 }

 def doSomething(rdd: RDD[Data]) : RDD[MyObject] = {
 rdd.flatMap(MyObject)
 }

 Any ideas as to how to improve the throughput?

 Thanks, Mike.





Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
Hi All,

I have cluster of four nodes (three workers and one master, with one core
each) which consumes data from Kinesis at 15 second intervals using two
streams (i.e. receivers). The job simply grabs the latest batch and pushes
it to MongoDB. I believe that the problem is that all tasks are executed on
a single worker node and never distributed to the others. This is true even
after I set the number of concurrentJobs to 3. Overall, I would really like
to increase throughput (i.e. more than 500 records / second) and understand
why all executors are not being utilized.

Here are some parameters I have set:

   -
   - spark.streaming.blockInterval   200
   - spark.locality.wait 500
   - spark.streaming.concurrentJobs  3

This is the code that's actually doing the writing:

def write(rdd: RDD[Data], time:Time) : Unit = {
val result = doSomething(rdd, time)
result.foreachPartition { i =
i.foreach(record = connection.insert(record))
}
}

def doSomething(rdd: RDD[Data]) : RDD[MyObject] = {
rdd.flatMap(MyObject)
}

Any ideas as to how to improve the throughput?

Thanks, Mike.


Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Mike Trienis
I guess each receiver occupies a executor. So there was only one executor
available for processing the job.

On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com
wrote:

 Hi All,

 I have cluster of four nodes (three workers and one master, with one core
 each) which consumes data from Kinesis at 15 second intervals using two
 streams (i.e. receivers). The job simply grabs the latest batch and pushes
 it to MongoDB. I believe that the problem is that all tasks are executed on
 a single worker node and never distributed to the others. This is true even
 after I set the number of concurrentJobs to 3. Overall, I would really like
 to increase throughput (i.e. more than 500 records / second) and understand
 why all executors are not being utilized.

 Here are some parameters I have set:

-
- spark.streaming.blockInterval   200
- spark.locality.wait 500
- spark.streaming.concurrentJobs  3

 This is the code that's actually doing the writing:

 def write(rdd: RDD[Data], time:Time) : Unit = {
 val result = doSomething(rdd, time)
 result.foreachPartition { i =
 i.foreach(record = connection.insert(record))
 }
 }

 def doSomething(rdd: RDD[Data]) : RDD[MyObject] = {
 rdd.flatMap(MyObject)
 }

 Any ideas as to how to improve the throughput?

 Thanks, Mike.



Re: Spark Streaming: all tasks running on one executor (Kinesis + Mongodb)

2015-05-22 Thread Evo Eftimov
A receiver occupies a cpu core, an executor is simply a jvm instance and as 
such it can be granted any number of cores and ram

So check how many cores you have per executor


Sent from Samsung Mobile

div Original message /divdivFrom: Mike Trienis 
mike.trie...@orcsol.com /divdivDate:2015/05/22  21:51  (GMT+00:00) 
/divdivTo: user@spark.apache.org /divdivSubject: Re: Spark Streaming: 
all tasks running on one executor (Kinesis + Mongodb) /divdiv
/divI guess each receiver occupies a executor. So there was only one executor 
available for processing the job. 

On Fri, May 22, 2015 at 1:24 PM, Mike Trienis mike.trie...@orcsol.com wrote:
Hi All,

I have cluster of four nodes (three workers and one master, with one core each) 
which consumes data from Kinesis at 15 second intervals using two streams (i.e. 
receivers). The job simply grabs the latest batch and pushes it to MongoDB. I 
believe that the problem is that all tasks are executed on a single worker node 
and never distributed to the others. This is true even after I set the number 
of concurrentJobs to 3. Overall, I would really like to increase throughput 
(i.e. more than 500 records / second) and understand why all executors are not 
being utilized. 

Here are some parameters I have set: 
spark.streaming.blockInterval       200
spark.locality.wait 500
spark.streaming.concurrentJobs      3
This is the code that's actually doing the writing:

def write(rdd: RDD[Data], time:Time) : Unit = {
    val result = doSomething(rdd, time)
    result.foreachPartition { i =
        i.foreach(record = connection.insert(record))
    }
}

def doSomething(rdd: RDD[Data]) : RDD[MyObject] = {
    rdd.flatMap(MyObject)
}

Any ideas as to how to improve the throughput?

Thanks, Mike.