Sorry I made a mistake. Please ignore my question. On Tue, Mar 3, 2015 at 2:47 AM, Saiph Kappa <saiph.ka...@gmail.com> wrote:
> I performed repartitioning and everything went fine with respect to the > number of CPU cores being used (and respective times). However, I noticed > something very strange: inside a map operation I was doing a very simple > calculation and always using the same dataset (small enough to be entirely > processed in the same batch); then I iterated the RDDs and calculated the > mean, "foreachRDD(rdd => println("MEAN: " + rdd.mean()))". I noticed that > for different numbers of partitions (for instance, 4 and 8), the result of > the mean is different. Why does this happen? > > On Thu, Feb 26, 2015 at 7:03 PM, Tathagata Das <t...@databricks.com> > wrote: > >> If you have one receiver, and you are doing only map-like operaitons then >> the process will primarily happen on one machine. To use all the machines, >> either receiver in parallel with multiple receivers, or spread out the >> computation by explicitly repartitioning the received streams >> (DStream.repartition) with sufficient partitions to load balance across >> more machines. >> >> TD >> >> On Thu, Feb 26, 2015 at 9:52 AM, Saiph Kappa <saiph.ka...@gmail.com> >> wrote: >> >>> One more question: while processing the exact same batch I noticed that >>> giving more CPUs to the worker does not decrease the duration of the batch. >>> I tried this with 4 and 8 CPUs. Though, I noticed that giving only 1 CPU >>> the duration increased, but apart from that the values were pretty similar, >>> whether I was using 4 or 6 or 8 CPUs. >>> >>> On Thu, Feb 26, 2015 at 5:35 PM, Saiph Kappa <saiph.ka...@gmail.com> >>> wrote: >>> >>>> By setting spark.eventLog.enabled to true it is possible to see the >>>> application UI after the application has finished its execution, however >>>> the Streaming tab is no longer visible. >>>> >>>> For measuring the duration of batches in the code I am doing something >>>> like this: >>>> «wordCharValues.foreachRDD(rdd => { >>>> val startTick = System.currentTimeMillis() >>>> val result = rdd.take(1) >>>> val timeDiff = System.currentTimeMillis() - startTick» >>>> >>>> But my quesiton is: is it possible to see the rate/throughput >>>> (records/sec) when I have a stream to process log files that appear in a >>>> folder? >>>> >>>> >>>> >>>> On Thu, Feb 26, 2015 at 1:36 AM, Tathagata Das <t...@databricks.com> >>>> wrote: >>>> >>>>> Yes. # tuples processed in a batch = sum of all the tuples received by >>>>> all the receivers. >>>>> >>>>> In screen shot, there was a batch with 69.9K records, and there was a >>>>> batch which took 1 s 473 ms. These two batches can be the same, can be >>>>> different batches. >>>>> >>>>> TD >>>>> >>>>> On Wed, Feb 25, 2015 at 10:11 AM, Josh J <joshjd...@gmail.com> wrote: >>>>> >>>>>> If I'm using the kafka receiver, can I assume the number of records >>>>>> processed in the batch is the sum of the number of records processed by >>>>>> the >>>>>> kafka receiver? >>>>>> >>>>>> So in the screen shot attached the max rate of tuples processed in a >>>>>> batch is 42.7K + 27.2K = 69.9K tuples processed in a batch with a max >>>>>> processing time of 1 second 473 ms? >>>>>> >>>>>> On Wed, Feb 25, 2015 at 8:48 AM, Akhil Das < >>>>>> ak...@sigmoidanalytics.com> wrote: >>>>>> >>>>>>> By throughput you mean Number of events processed etc? >>>>>>> >>>>>>> [image: Inline image 1] >>>>>>> >>>>>>> Streaming tab already have these statistics. >>>>>>> >>>>>>> >>>>>>> >>>>>>> Thanks >>>>>>> Best Regards >>>>>>> >>>>>>> On Wed, Feb 25, 2015 at 9:59 PM, Josh J <joshjd...@gmail.com> wrote: >>>>>>> >>>>>>>> >>>>>>>> On Wed, Feb 25, 2015 at 7:54 AM, Akhil Das < >>>>>>>> ak...@sigmoidanalytics.com> wrote: >>>>>>>> >>>>>>>>> For SparkStreaming applications, there is already a tab called >>>>>>>>> "Streaming" which displays the basic statistics. >>>>>>>> >>>>>>>> >>>>>>>> Would I just need to extend this tab to add the throughput? >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>> >>>>> >>>> >>> >> >