Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

Dhruv Kumar Thu, 28 Jun 2018 15:16:05 -0700

Hi

Can some one please take a look at below? Any help is deeply appreciated.


--------------------------------------------------
Dhruv Kumar
PhD Candidate
Department of Computer Science and Engineering
University of Minnesota
www.dhruvkumar.me

> On Jun 22, 2018, at 13:12, Dhruv Kumar <gargdhru...@gmail.com> wrote:
> 
> I actually tried the File Source (reading CSV files as stream and processing 
> them). File source seems to be generating valid numbers in the metrics log 
> files. I may be wrong but seems like an issue with the Rate source generating 
> metrics in the metrics log files.
>   
> --------------------------------------------------
> Dhruv Kumar
> PhD Candidate
> Department of Computer Science and Engineering
> University of Minnesota
> www.dhruvkumar.me <http://www.dhruvkumar.me/>
> 
>> On Jun 22, 2018, at 00:35, Dhruv Kumar <gargdhru...@gmail.com 
>> <mailto:gargdhru...@gmail.com>> wrote:
>> 
>> Thanks a lot for your mail Jungtaek. I added the StreamingQueryListener into 
>> my code (updated code 
>> <https://gist.github.com/kudhru/e1ce6b3f399c546be5eeb1f590087992>) and was 
>> able to see valid inputRowsPerSecond, processRowsPerSecond numbers. But it 
>> also shows zeros intermittently. Here is the sample output 
>> <https://gist.github.com/kudhru/db2bced789c528464620ae1767597127>  Could you 
>> explain why is this the case? 
>> Unfortunately, the csv files still show zeros only except few non-zeros. Do 
>> you know why this may be happening? (I changed the metrics.properties to 
>> print every second instead of every 10 seconds). 
>> Here is the output of the metrics log file 
>> (run_latest.driver.spark.streaming.aggregates.inputRate-total.csv)
>> t,value
>> 1529645042,0.0
>> 1529645043,0.0
>> 1529645044,0.0
>> 1529645045,NaN
>> 1529645046,88967.97153024911
>> 1529645047,100200.4008016032
>> 1529645048,122100.12210012211
>> 1529645049,0.0
>> 1529645050,0.0
>> 1529645051,0.0
>> 1529645052,0.0
>> 1529645053,0.0
>> 1529645054,0.0
>> 1529645055,0.0
>> 1529645056,0.0
>> 1529645057,0.0
>> 1529645058,0.0
>> 1529645059,0.0
>> 1529645060,0.0
>> 1529645061,0.0
>> 1529645062,0.0
>> 1529645063,0.0
>> 1529645064,0.0
>> 1529645065,0.0
>> 1529645066,0.0
>> 1529645067,0.0
>> 1529645068,0.0
>> 1529645069,0.0
>> 1529645070,0.0
>> 1529645071,0.0
>> 1529645072,93808.63039399624
>> 1529645073,0.0
>> 1529645074,0.0
>> 1529645075,0.0
>> 1529645076,0.0
>> 1529645077,0.0
>> 1529645078,0.0
>> 
>> 
>> 
>> --------------------------------------------------
>> Dhruv Kumar
>> PhD Candidate
>> Department of Computer Science and Engineering
>> University of Minnesota
>> www.dhruvkumar.me <http://www.dhruvkumar.me/>
>> 
>>> On Jun 21, 2018, at 23:07, Jungtaek Lim <kabh...@gmail.com 
>>> <mailto:kabh...@gmail.com>> wrote:
>>> 
>>> I'm referring to 2.4.0-SNAPSHOT (not sure which commit I'm referring) but 
>>> it properly returns the input rate.
>>> 
>>> $ tail -F 
>>> /tmp/spark-trial-metric/local-1529640063554.driver.spark.streaming.counts.inputRate-total.csv
>>> t,value
>>> 1529640073,0.0
>>> 1529640083,0.9411272613196695
>>> 1529640093,0.9430996541967934
>>> 1529640103,1.0606060606060606
>>> 1529640113,0.9997000899730081
>>> 
>>> Could you add streaming query listener and see the value of sources -> 
>>> numInputRows, inputRowsPerSecond, processedRowsPerSecond? They should 
>>> provide some valid numbers.
>>> 
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>> 
>>> 2018년 6월 22일 (금) 오전 11:49, Dhruv Kumar <gargdhru...@gmail.com 
>>> <mailto:gargdhru...@gmail.com>>님이 작성:
>>> Hi
>>> 
>>> I was trying to measure the performance metrics for spark structured 
>>> streaming. But I am unable to see any data in the metrics log files. My 
>>> input source is the Rate source 
>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets>
>>>  which generates data at the specified number of rows per second. Here is 
>>> the link to my code 
>>> <https://gist.github.com/kudhru/e1ce6b3f399c546be5eeb1f590087992> and 
>>> metrics.properties 
>>> <https://gist.github.com/kudhru/5d8a8f4d53c766e9efad4de2ae9b82d6> file.
>>> 
>>> When I run the above mentioned code using spark-submit, I see that the 
>>> metrics logs (for example, 
>>> run_1.driver.spark.streaming.aggregates.inputRate-total.csv) are created 
>>> under the specified directory but most of the values are 0. 
>>> Below is a portion of the inputeRate-total.csv file:
>>> 1529634585,0.0
>>> 1529634595,0.0
>>> 1529634605,0.0
>>> 1529634615,0.0
>>> 1529634625,0.0
>>> 1529634635,0.0
>>> 1529634645,0.0
>>> 1529634655,0.0
>>> 1529634665,0.0
>>> 1529634675,0.0
>>> 1529634685,0.0
>>> 1529634695,0.0
>>> 1529634705,0.0
>>> 1529634715,0.0
>>> 
>>> Any reason as to why this must be happening? Happy to share more 
>>> information if that helps.
>>> 
>>> Thanks
>>> --------------------------------------------------
>>> Dhruv Kumar
>>> PhD Candidate
>>> Department of Computer Science and Engineering
>>> University of Minnesota
>>> www.dhruvkumar.me <http://www.dhruvkumar.me/>
>> 
>

Re: [Spark Structured Streaming] Measure metrics from CsvSink for Rate source

Reply via email to