Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-28 Thread Dhruv Kumar
Hi

Can some one please take a look at below? Any help is deeply appreciated.

--
Dhruv Kumar
PhD Candidate
Department of Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me

> On Jun 22, 2018, at 13:12, Dhruv Kumar  wrote:
> 
> I actually tried the File Source (reading CSV files as stream and processing 
> them). File source seems to be generating valid numbers in the metrics log 
> files. I may be wrong but seems like an issue with the Rate source generating 
> metrics in the metrics log files.
>   
> --
> Dhruv Kumar
> PhD Candidate
> Department of Computer Science and Engineering
> University of Minnesota
> www.dhruvkumar.me 
> 
>> On Jun 22, 2018, at 00:35, Dhruv Kumar > > wrote:
>> 
>> Thanks a lot for your mail Jungtaek. I added the StreamingQueryListener into 
>> my code (updated code 
>> ) and was 
>> able to see valid inputRowsPerSecond, processRowsPerSecond numbers. But it 
>> also shows zeros intermittently. Here is the sample output 
>>   Could you 
>> explain why is this the case? 
>> Unfortunately, the csv files still show zeros only except few non-zeros. Do 
>> you know why this may be happening? (I changed the metrics.properties to 
>> print every second instead of every 10 seconds). 
>> Here is the output of the metrics log file 
>> (run_latest.driver.spark.streaming.aggregates.inputRate-total.csv)
>> t,value
>> 1529645042,0.0
>> 1529645043,0.0
>> 1529645044,0.0
>> 1529645045,NaN
>> 1529645046,88967.97153024911
>> 1529645047,100200.4008016032
>> 1529645048,122100.12210012211
>> 1529645049,0.0
>> 1529645050,0.0
>> 1529645051,0.0
>> 1529645052,0.0
>> 1529645053,0.0
>> 1529645054,0.0
>> 1529645055,0.0
>> 1529645056,0.0
>> 1529645057,0.0
>> 1529645058,0.0
>> 1529645059,0.0
>> 1529645060,0.0
>> 1529645061,0.0
>> 1529645062,0.0
>> 1529645063,0.0
>> 1529645064,0.0
>> 1529645065,0.0
>> 1529645066,0.0
>> 1529645067,0.0
>> 1529645068,0.0
>> 1529645069,0.0
>> 1529645070,0.0
>> 1529645071,0.0
>> 1529645072,93808.63039399624
>> 1529645073,0.0
>> 1529645074,0.0
>> 1529645075,0.0
>> 1529645076,0.0
>> 1529645077,0.0
>> 1529645078,0.0
>> 
>> 
>> 
>> --
>> Dhruv Kumar
>> PhD Candidate
>> Department of Computer Science and Engineering
>> University of Minnesota
>> www.dhruvkumar.me 
>> 
>>> On Jun 21, 2018, at 23:07, Jungtaek Lim >> > wrote:
>>> 
>>> I'm referring to 2.4.0-SNAPSHOT (not sure which commit I'm referring) but 
>>> it properly returns the input rate.
>>> 
>>> $ tail -F 
>>> /tmp/spark-trial-metric/local-1529640063554.driver.spark.streaming.counts.inputRate-total.csv
>>> t,value
>>> 1529640073,0.0
>>> 1529640083,0.9411272613196695
>>> 1529640093,0.9430996541967934
>>> 1529640103,1.0606060606060606
>>> 1529640113,0.9997000899730081
>>> 
>>> Could you add streaming query listener and see the value of sources -> 
>>> numInputRows, inputRowsPerSecond, processedRowsPerSecond? They should 
>>> provide some valid numbers.
>>> 
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>> 
>>> 2018년 6월 22일 (금) 오전 11:49, Dhruv Kumar >> >님이 작성:
>>> Hi
>>> 
>>> I was trying to measure the performance metrics for spark structured 
>>> streaming. But I am unable to see any data in the metrics log files. My 
>>> input source is the Rate source 
>>> 
>>>  which generates data at the specified number of rows per second. Here is 
>>> the link to my code 
>>>  and 
>>> metrics.properties 
>>>  file.
>>> 
>>> When I run the above mentioned code using spark-submit, I see that the 
>>> metrics logs (for example, 
>>> run_1.driver.spark.streaming.aggregates.inputRate-total.csv) are created 
>>> under the specified directory but most of the values are 0. 
>>> Below is a portion of the inputeRate-total.csv file:
>>> 1529634585,0.0
>>> 1529634595,0.0
>>> 1529634605,0.0
>>> 1529634615,0.0
>>> 1529634625,0.0
>>> 1529634635,0.0
>>> 1529634645,0.0
>>> 1529634655,0.0
>>> 1529634665,0.0
>>> 1529634675,0.0
>>> 1529634685,0.0
>>> 1529634695,0.0
>>> 1529634705,0.0
>>> 1529634715,0.0
>>> 
>>> Any reason as to why this must be happening? Happy to share more 
>>> information if that helps.
>>> 
>>> Thanks
>>> --
>>> Dhruv Kumar
>>> PhD Candidate
>>> Department of Computer Science and Engineering
>>> University of Minnesota
>>> www.dhruvkumar.me 
>> 
> 



Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-22 Thread Dhruv Kumar
I actually tried the File Source (reading CSV files as stream and processing 
them). File source seems to be generating valid numbers in the metrics log 
files. I may be wrong but seems like an issue with the Rate source generating 
metrics in the metrics log files.
  
--
Dhruv Kumar
PhD Candidate
Department of Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me

> On Jun 22, 2018, at 00:35, Dhruv Kumar  wrote:
> 
> Thanks a lot for your mail Jungtaek. I added the StreamingQueryListener into 
> my code (updated code 
> ) and was 
> able to see valid inputRowsPerSecond, processRowsPerSecond numbers. But it 
> also shows zeros intermittently. Here is the sample output 
>   Could you 
> explain why is this the case? 
> Unfortunately, the csv files still show zeros only except few non-zeros. Do 
> you know why this may be happening? (I changed the metrics.properties to 
> print every second instead of every 10 seconds). 
> Here is the output of the metrics log file 
> (run_latest.driver.spark.streaming.aggregates.inputRate-total.csv)
> t,value
> 1529645042,0.0
> 1529645043,0.0
> 1529645044,0.0
> 1529645045,NaN
> 1529645046,88967.97153024911
> 1529645047,100200.4008016032
> 1529645048,122100.12210012211
> 1529645049,0.0
> 1529645050,0.0
> 1529645051,0.0
> 1529645052,0.0
> 1529645053,0.0
> 1529645054,0.0
> 1529645055,0.0
> 1529645056,0.0
> 1529645057,0.0
> 1529645058,0.0
> 1529645059,0.0
> 1529645060,0.0
> 1529645061,0.0
> 1529645062,0.0
> 1529645063,0.0
> 1529645064,0.0
> 1529645065,0.0
> 1529645066,0.0
> 1529645067,0.0
> 1529645068,0.0
> 1529645069,0.0
> 1529645070,0.0
> 1529645071,0.0
> 1529645072,93808.63039399624
> 1529645073,0.0
> 1529645074,0.0
> 1529645075,0.0
> 1529645076,0.0
> 1529645077,0.0
> 1529645078,0.0
> 
> 
> 
> --
> Dhruv Kumar
> PhD Candidate
> Department of Computer Science and Engineering
> University of Minnesota
> www.dhruvkumar.me 
> 
>> On Jun 21, 2018, at 23:07, Jungtaek Lim > > wrote:
>> 
>> I'm referring to 2.4.0-SNAPSHOT (not sure which commit I'm referring) but it 
>> properly returns the input rate.
>> 
>> $ tail -F 
>> /tmp/spark-trial-metric/local-1529640063554.driver.spark.streaming.counts.inputRate-total.csv
>> t,value
>> 1529640073,0.0
>> 1529640083,0.9411272613196695
>> 1529640093,0.9430996541967934
>> 1529640103,1.0606060606060606
>> 1529640113,0.9997000899730081
>> 
>> Could you add streaming query listener and see the value of sources -> 
>> numInputRows, inputRowsPerSecond, processedRowsPerSecond? They should 
>> provide some valid numbers.
>> 
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>> 
>> 2018년 6월 22일 (금) 오전 11:49, Dhruv Kumar > >님이 작성:
>> Hi
>> 
>> I was trying to measure the performance metrics for spark structured 
>> streaming. But I am unable to see any data in the metrics log files. My 
>> input source is the Rate source 
>> 
>>  which generates data at the specified number of rows per second. Here is 
>> the link to my code 
>>  and 
>> metrics.properties 
>>  file.
>> 
>> When I run the above mentioned code using spark-submit, I see that the 
>> metrics logs (for example, 
>> run_1.driver.spark.streaming.aggregates.inputRate-total.csv) are created 
>> under the specified directory but most of the values are 0. 
>> Below is a portion of the inputeRate-total.csv file:
>> 1529634585,0.0
>> 1529634595,0.0
>> 1529634605,0.0
>> 1529634615,0.0
>> 1529634625,0.0
>> 1529634635,0.0
>> 1529634645,0.0
>> 1529634655,0.0
>> 1529634665,0.0
>> 1529634675,0.0
>> 1529634685,0.0
>> 1529634695,0.0
>> 1529634705,0.0
>> 1529634715,0.0
>> 
>> Any reason as to why this must be happening? Happy to share more information 
>> if that helps.
>> 
>> Thanks
>> --
>> Dhruv Kumar
>> PhD Candidate
>> Department of Computer Science and Engineering
>> University of Minnesota
>> www.dhruvkumar.me 
> 



Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-21 Thread Dhruv Kumar
Thanks a lot for your mail Jungtaek. I added the StreamingQueryListener into my 
code (updated code 
) and was able 
to see valid inputRowsPerSecond, processRowsPerSecond numbers. But it also 
shows zeros intermittently. Here is the sample output 
  Could you 
explain why is this the case? 
Unfortunately, the csv files still show zeros only except few non-zeros. Do you 
know why this may be happening? (I changed the metrics.properties to print 
every second instead of every 10 seconds). 
Here is the output of the metrics log file 
(run_latest.driver.spark.streaming.aggregates.inputRate-total.csv)
t,value
1529645042,0.0
1529645043,0.0
1529645044,0.0
1529645045,NaN
1529645046,88967.97153024911
1529645047,100200.4008016032
1529645048,122100.12210012211
1529645049,0.0
1529645050,0.0
1529645051,0.0
1529645052,0.0
1529645053,0.0
1529645054,0.0
1529645055,0.0
1529645056,0.0
1529645057,0.0
1529645058,0.0
1529645059,0.0
1529645060,0.0
1529645061,0.0
1529645062,0.0
1529645063,0.0
1529645064,0.0
1529645065,0.0
1529645066,0.0
1529645067,0.0
1529645068,0.0
1529645069,0.0
1529645070,0.0
1529645071,0.0
1529645072,93808.63039399624
1529645073,0.0
1529645074,0.0
1529645075,0.0
1529645076,0.0
1529645077,0.0
1529645078,0.0



--
Dhruv Kumar
PhD Candidate
Department of Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me

> On Jun 21, 2018, at 23:07, Jungtaek Lim  wrote:
> 
> I'm referring to 2.4.0-SNAPSHOT (not sure which commit I'm referring) but it 
> properly returns the input rate.
> 
> $ tail -F 
> /tmp/spark-trial-metric/local-1529640063554.driver.spark.streaming.counts.inputRate-total.csv
> t,value
> 1529640073,0.0
> 1529640083,0.9411272613196695
> 1529640093,0.9430996541967934
> 1529640103,1.0606060606060606
> 1529640113,0.9997000899730081
> 
> Could you add streaming query listener and see the value of sources -> 
> numInputRows, inputRowsPerSecond, processedRowsPerSecond? They should provide 
> some valid numbers.
> 
> Thanks,
> Jungtaek Lim (HeartSaVioR)
> 
> 2018년 6월 22일 (금) 오전 11:49, Dhruv Kumar  >님이 작성:
> Hi
> 
> I was trying to measure the performance metrics for spark structured 
> streaming. But I am unable to see any data in the metrics log files. My input 
> source is the Rate source 
> 
>  which generates data at the specified number of rows per second. Here is the 
> link to my code 
>  and 
> metrics.properties 
>  file.
> 
> When I run the above mentioned code using spark-submit, I see that the 
> metrics logs (for example, 
> run_1.driver.spark.streaming.aggregates.inputRate-total.csv) are created 
> under the specified directory but most of the values are 0. 
> Below is a portion of the inputeRate-total.csv file:
> 1529634585,0.0
> 1529634595,0.0
> 1529634605,0.0
> 1529634615,0.0
> 1529634625,0.0
> 1529634635,0.0
> 1529634645,0.0
> 1529634655,0.0
> 1529634665,0.0
> 1529634675,0.0
> 1529634685,0.0
> 1529634695,0.0
> 1529634705,0.0
> 1529634715,0.0
> 
> Any reason as to why this must be happening? Happy to share more information 
> if that helps.
> 
> Thanks
> --
> Dhruv Kumar
> PhD Candidate
> Department of Computer Science and Engineering
> University of Minnesota
> www.dhruvkumar.me 



Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

2018-06-21 Thread Jungtaek Lim
I'm referring to 2.4.0-SNAPSHOT (not sure which commit I'm referring) but
it properly returns the input rate.

$ tail -F
/tmp/spark-trial-metric/local-1529640063554.driver.spark.streaming.counts.inputRate-total.csv
t,value
1529640073,0.0
1529640083,0.9411272613196695
1529640093,0.9430996541967934
1529640103,1.0606060606060606
1529640113,0.9997000899730081

Could you add streaming query listener and see the value of sources ->
numInputRows, inputRowsPerSecond, processedRowsPerSecond? They should
provide some valid numbers.

Thanks,
Jungtaek Lim (HeartSaVioR)

2018년 6월 22일 (금) 오전 11:49, Dhruv Kumar 님이 작성:

> Hi
>
> I was trying to measure the performance metrics for spark structured
> streaming. But I am unable to see any data in the metrics log files. My
> input source is the Rate source
> 
>  which
> generates data at the specified number of rows per second. Here is the link
> to my code
>  and
> metrics.properties
>  file.
>
> When I run the above mentioned code using *spark-submit, *I see that the
> metrics logs (for example,
> run_1.driver.spark.streaming.aggregates.inputRate-total.csv) are created
> under the specified directory but most of the values are 0.
> Below is a portion of the inputeRate-total.csv file:
> 1529634585,0.0
> 1529634595,0.0
> 1529634605,0.0
> 1529634615,0.0
> 1529634625,0.0
> 1529634635,0.0
> 1529634645,0.0
> 1529634655,0.0
> 1529634665,0.0
> 1529634675,0.0
> 1529634685,0.0
> 1529634695,0.0
> 1529634705,0.0
> 1529634715,0.0
>
> Any reason as to why this must be happening? Happy to share more
> information if that helps.
>
> Thanks
> --
> *Dhruv Kumar*
> PhD Candidate
> Department of Computer Science and Engineering
> University of Minnesota
> www.dhruvkumar.me
>
>