Re: throughput in the web console?

2015-03-03 Thread Saiph Kappa
Sorry I made a mistake. Please ignore my question.

On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa saiph.ka...@gmail.com wrote:

 I performed repartitioning and everything went fine with respect to the
 number of CPU cores being used (and respective times). However, I noticed
 something very strange: inside a map operation I was doing a very simple
 calculation and always using the same dataset (small enough to be entirely
 processed in the same batch); then I iterated the RDDs and calculated the
 mean, foreachRDD(rdd = println(MEAN:  + rdd.mean())). I noticed that
 for different numbers of partitions (for instance, 4 and 8), the result of
 the mean is different. Why does this happen?

 On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das t...@databricks.com
 wrote:

 If you have one receiver, and you are doing only map-like operaitons then
 the process will primarily happen on one machine. To use all the machines,
 either receiver in parallel with multiple receivers, or spread out the
 computation by explicitly repartitioning the received streams
 (DStream.repartition) with sufficient partitions to load balance across
 more machines.

 TD

 On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 One more question: while processing the exact same batch I noticed that
 giving more CPUs to the worker does not decrease the duration of the batch.
 I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
 the duration increased, but apart from that the values were pretty similar,
 whether I was using 4 or 6 or 8 CPUs.

 On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 By setting spark.eventLog.enabled to true it is possible to see the
 application UI after the application has finished its execution, however
 the Streaming tab is no longer visible.

 For measuring the duration of batches in the code I am doing something
 like this:
 «wordCharValues.foreachRDD(rdd = {
 val startTick = System.currentTimeMillis()
 val result = rdd.take(1)
 val timeDiff = System.currentTimeMillis() - startTick»

 But my quesiton is: is it possible to see the rate/throughput
 (records/sec) when I have a stream to process log files that appear in a
 folder?



 On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com
 wrote:

 Yes. # tuples processed in a batch = sum of all the tuples received by
 all the receivers.

 In screen shot, there was a batch with 69.9K records, and there was a
 batch which took 1 s 473 ms. These two batches can be the same, can be
 different batches.

 TD

 On Wed, Feb 25, 2015 at 10:11 AM, Josh J joshjd...@gmail.com wrote:

 If I'm using the kafka receiver, can I assume the number of records
 processed in the batch is the sum of the number of records processed by 
 the
 kafka receiver?

 So in the screen shot attached the max rate of tuples processed in a
 batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
 processing time of 1 second 473 ms?

 On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 By throughput you mean Number of events processed etc?

 [image: Inline image 1]

 Streaming tab already have these statistics.



 Thanks
 Best Regards

 On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org









Re: throughput in the web console?

2015-03-02 Thread Saiph Kappa
I performed repartitioning and everything went fine with respect to the
number of CPU cores being used (and respective times). However, I noticed
something very strange: inside a map operation I was doing a very simple
calculation and always using the same dataset (small enough to be entirely
processed in the same batch); then I iterated the RDDs and calculated the
mean, foreachRDD(rdd = println(MEAN:  + rdd.mean())). I noticed that
for different numbers of partitions (for instance, 4 and 8), the result of
the mean is different. Why does this happen?

On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das t...@databricks.com wrote:

 If you have one receiver, and you are doing only map-like operaitons then
 the process will primarily happen on one machine. To use all the machines,
 either receiver in parallel with multiple receivers, or spread out the
 computation by explicitly repartitioning the received streams
 (DStream.repartition) with sufficient partitions to load balance across
 more machines.

 TD

 On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 One more question: while processing the exact same batch I noticed that
 giving more CPUs to the worker does not decrease the duration of the batch.
 I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
 the duration increased, but apart from that the values were pretty similar,
 whether I was using 4 or 6 or 8 CPUs.

 On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 By setting spark.eventLog.enabled to true it is possible to see the
 application UI after the application has finished its execution, however
 the Streaming tab is no longer visible.

 For measuring the duration of batches in the code I am doing something
 like this:
 «wordCharValues.foreachRDD(rdd = {
 val startTick = System.currentTimeMillis()
 val result = rdd.take(1)
 val timeDiff = System.currentTimeMillis() - startTick»

 But my quesiton is: is it possible to see the rate/throughput
 (records/sec) when I have a stream to process log files that appear in a
 folder?



 On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com
 wrote:

 Yes. # tuples processed in a batch = sum of all the tuples received by
 all the receivers.

 In screen shot, there was a batch with 69.9K records, and there was a
 batch which took 1 s 473 ms. These two batches can be the same, can be
 different batches.

 TD

 On Wed, Feb 25, 2015 at 10:11 AM, Josh J joshjd...@gmail.com wrote:

 If I'm using the kafka receiver, can I assume the number of records
 processed in the batch is the sum of the number of records processed by 
 the
 kafka receiver?

 So in the screen shot attached the max rate of tuples processed in a
 batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
 processing time of 1 second 473 ms?

 On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 By throughput you mean Number of events processed etc?

 [image: Inline image 1]

 Streaming tab already have these statistics.



 Thanks
 Best Regards

 On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org








Re: throughput in the web console?

2015-02-26 Thread Saiph Kappa
One more question: while processing the exact same batch I noticed that
giving more CPUs to the worker does not decrease the duration of the batch.
I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
the duration increased, but apart from that the values were pretty similar,
whether I was using 4 or 6 or 8 CPUs.

On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa saiph.ka...@gmail.com wrote:

 By setting spark.eventLog.enabled to true it is possible to see the
 application UI after the application has finished its execution, however
 the Streaming tab is no longer visible.

 For measuring the duration of batches in the code I am doing something
 like this:
 «wordCharValues.foreachRDD(rdd = {
 val startTick = System.currentTimeMillis()
 val result = rdd.take(1)
 val timeDiff = System.currentTimeMillis() - startTick»

 But my quesiton is: is it possible to see the rate/throughput
 (records/sec) when I have a stream to process log files that appear in a
 folder?



 On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com
 wrote:

 Yes. # tuples processed in a batch = sum of all the tuples received by
 all the receivers.

 In screen shot, there was a batch with 69.9K records, and there was a
 batch which took 1 s 473 ms. These two batches can be the same, can be
 different batches.

 TD

 On Wed, Feb 25, 2015 at 10:11 AM, Josh J joshjd...@gmail.com wrote:

 If I'm using the kafka receiver, can I assume the number of records
 processed in the batch is the sum of the number of records processed by the
 kafka receiver?

 So in the screen shot attached the max rate of tuples processed in a
 batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
 processing time of 1 second 473 ms?

 On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 By throughput you mean Number of events processed etc?

 [image: Inline image 1]

 Streaming tab already have these statistics.



 Thanks
 Best Regards

 On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
  wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: throughput in the web console?

2015-02-26 Thread Saiph Kappa
By setting spark.eventLog.enabled to true it is possible to see the
application UI after the application has finished its execution, however
the Streaming tab is no longer visible.

For measuring the duration of batches in the code I am doing something like
this:
«wordCharValues.foreachRDD(rdd = {
val startTick = System.currentTimeMillis()
val result = rdd.take(1)
val timeDiff = System.currentTimeMillis() - startTick»

But my quesiton is: is it possible to see the rate/throughput (records/sec)
when I have a stream to process log files that appear in a folder?



On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com wrote:

 Yes. # tuples processed in a batch = sum of all the tuples received by all
 the receivers.

 In screen shot, there was a batch with 69.9K records, and there was a
 batch which took 1 s 473 ms. These two batches can be the same, can be
 different batches.

 TD

 On Wed, Feb 25, 2015 at 10:11 AM, Josh J joshjd...@gmail.com wrote:

 If I'm using the kafka receiver, can I assume the number of records
 processed in the batch is the sum of the number of records processed by the
 kafka receiver?

 So in the screen shot attached the max rate of tuples processed in a
 batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
 processing time of 1 second 473 ms?

 On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 By throughput you mean Number of events processed etc?

 [image: Inline image 1]

 Streaming tab already have these statistics.



 Thanks
 Best Regards

 On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: throughput in the web console?

2015-02-26 Thread Tathagata Das
If you have one receiver, and you are doing only map-like operaitons then
the process will primarily happen on one machine. To use all the machines,
either receiver in parallel with multiple receivers, or spread out the
computation by explicitly repartitioning the received streams
(DStream.repartition) with sufficient partitions to load balance across
more machines.

TD

On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa saiph.ka...@gmail.com wrote:

 One more question: while processing the exact same batch I noticed that
 giving more CPUs to the worker does not decrease the duration of the batch.
 I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU
 the duration increased, but apart from that the values were pretty similar,
 whether I was using 4 or 6 or 8 CPUs.

 On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa saiph.ka...@gmail.com
 wrote:

 By setting spark.eventLog.enabled to true it is possible to see the
 application UI after the application has finished its execution, however
 the Streaming tab is no longer visible.

 For measuring the duration of batches in the code I am doing something
 like this:
 «wordCharValues.foreachRDD(rdd = {
 val startTick = System.currentTimeMillis()
 val result = rdd.take(1)
 val timeDiff = System.currentTimeMillis() - startTick»

 But my quesiton is: is it possible to see the rate/throughput
 (records/sec) when I have a stream to process log files that appear in a
 folder?



 On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das t...@databricks.com
 wrote:

 Yes. # tuples processed in a batch = sum of all the tuples received by
 all the receivers.

 In screen shot, there was a batch with 69.9K records, and there was a
 batch which took 1 s 473 ms. These two batches can be the same, can be
 different batches.

 TD

 On Wed, Feb 25, 2015 at 10:11 AM, Josh J joshjd...@gmail.com wrote:

 If I'm using the kafka receiver, can I assume the number of records
 processed in the batch is the sum of the number of records processed by the
 kafka receiver?

 So in the screen shot attached the max rate of tuples processed in a
 batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max
 processing time of 1 second 473 ms?

 On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 By throughput you mean Number of events processed etc?

 [image: Inline image 1]

 Streaming tab already have these statistics.



 Thanks
 Best Regards

 On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das 
 ak...@sigmoidanalytics.com wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org







Re: throughput in the web console?

2015-02-25 Thread Josh J
Let me ask like this, what would be the easiest way to display the
throughput in the web console? Would I need to create a new tab and add the
metrics? Any good or simple examples showing how this can be done?

On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Did you have a look at


 https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener

 And for Streaming:


 https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener



 Thanks
 Best Regards

 On Tue, Feb 24, 2015 at 10:29 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 I plan to run a parameter search varying the number of cores, epoch, and
 parallelism. The web console provides a way to archive the previous runs,
 though is there a way to view in the console the throughput? Rather than
 logging the throughput separately to the log files and correlating the logs
 files to the web console processing times?

 Thanks,
 Josh





Re: throughput in the web console?

2015-02-25 Thread Akhil Das
For SparkStreaming applications, there is already a tab called Streaming
which displays the basic statistics.

Thanks
Best Regards

On Wed, Feb 25, 2015 at 8:55 PM, Josh J joshjd...@gmail.com wrote:

 Let me ask like this, what would be the easiest way to display the
 throughput in the web console? Would I need to create a new tab and add the
 metrics? Any good or simple examples showing how this can be done?

 On Wed, Feb 25, 2015 at 12:07 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Did you have a look at


 https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener

 And for Streaming:


 https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener



 Thanks
 Best Regards

 On Tue, Feb 24, 2015 at 10:29 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 I plan to run a parameter search varying the number of cores, epoch, and
 parallelism. The web console provides a way to archive the previous runs,
 though is there a way to view in the console the throughput? Rather than
 logging the throughput separately to the log files and correlating the logs
 files to the web console processing times?

 Thanks,
 Josh






Re: throughput in the web console?

2015-02-25 Thread Josh J
On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 For SparkStreaming applications, there is already a tab called Streaming
 which displays the basic statistics.


Would I just need to extend this tab to add the throughput?


Re: throughput in the web console?

2015-02-25 Thread Akhil Das
By throughput you mean Number of events processed etc?

[image: Inline image 1]

Streaming tab already have these statistics.



Thanks
Best Regards

On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?



Re: throughput in the web console?

2015-02-25 Thread Otis Gospodnetic
Hi Josh,

SPM will show you this info. I see you use Kafka, too, whose numerous metrics 
you can also see in SPM side by side with your Spark metrics.  Sounds like 
trends is what you are after, so I hope this helps.  See http://sematext.com/spm

Otis

 

 On Feb 24, 2015, at 11:59, Josh J joshjd...@gmail.com wrote:
 
 Hi,
 
 I plan to run a parameter search varying the number of cores, epoch, and 
 parallelism. The web console provides a way to archive the previous runs, 
 though is there a way to view in the console the throughput? Rather than 
 logging the throughput separately to the log files and correlating the logs 
 files to the web console processing times?
 
 Thanks,
 Josh

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: throughput in the web console?

2015-02-25 Thread Tathagata Das
Yes. # tuples processed in a batch = sum of all the tuples received by all
the receivers.

In screen shot, there was a batch with 69.9K records, and there was a batch
which took 1 s 473 ms. These two batches can be the same, can be different
batches.

TD

On Wed, Feb 25, 2015 at 10:11 AM, Josh J joshjd...@gmail.com wrote:

 If I'm using the kafka receiver, can I assume the number of records
 processed in the batch is the sum of the number of records processed by the
 kafka receiver?

 So in the screen shot attached the max rate of tuples processed in a batch
 is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max processing
 time of 1 second 473 ms?

 On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 By throughput you mean Number of events processed etc?

 [image: Inline image 1]

 Streaming tab already have these statistics.



 Thanks
 Best Regards

 On Wed, Feb 25, 2015 at 9:59 PM, Josh J joshjd...@gmail.com wrote:


 On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 For SparkStreaming applications, there is already a tab called
 Streaming which displays the basic statistics.


 Would I just need to extend this tab to add the throughput?





 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



Re: throughput in the web console?

2015-02-25 Thread Akhil Das
Did you have a look at

https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.scheduler.SparkListener

And for Streaming:

https://spark.apache.org/docs/1.0.2/api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener



Thanks
Best Regards

On Tue, Feb 24, 2015 at 10:29 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 I plan to run a parameter search varying the number of cores, epoch, and
 parallelism. The web console provides a way to archive the previous runs,
 though is there a way to view in the console the throughput? Rather than
 logging the throughput separately to the log files and correlating the logs
 files to the web console processing times?

 Thanks,
 Josh



throughput in the web console?

2015-02-24 Thread Josh J
Hi,

I plan to run a parameter search varying the number of cores, epoch, and
parallelism. The web console provides a way to archive the previous runs,
though is there a way to view in the console the throughput? Rather than
logging the throughput separately to the log files and correlating the logs
files to the web console processing times?

Thanks,
Josh