Re: Need some help in identifying some important metrics to monitor for streams

Eno Thereska Fri, 03 Mar 2017 04:03:08 -0800

Hi Sachin,

Now that the confluent platform 3.2 is out, we also have some more 
documentation on this here: 
http://docs.confluent.io/3.2.0/streams/monitoring.html 
<http://docs.confluent.io/3.2.0/streams/monitoring.html>. We added a note on 
how to add other metrics.


Yeah, your calculation on poll time makes sense. The important metrics are the 
“info” ones that are on by default. However, for stageful applications, if you 
suspect that state stores might be bottlenecking, you might want to collect 
those metrics too. 

On the benchmarks, the one called “processstreamwithstatestore” and “count” are 
the closest to a benchmarking on RocksDb with the default configs. The first 
writes each record to RocksDb, while the second performs simple aggregates 
(reads and writes from/to RocksDb). 

We might need to add more benchmarks here, would be great to get some ideas and 
help from the community. E.g., a pure RocksDb benchmark that doesn’t go through 
streams at all. 

Could you open a JIRA on the name issue please? As an “improvement”.

Thanks
Eno



> On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sjmit...@gmail.com> wrote:
> 
> Hi,
> I had checked the monitoring docs, but could not figure out which metrics
> are important ones.
> 
> Also mainly I am looking at the average time spent between 2 successive
> poll requests.
> Can I say that average time between 2 poll requests is sum of
> 
> commit + poll + process + punctuate (latency-avg).
> 
> 
> Also I checked the benchmark tests results but could not find any
> information on rocksdb metrics for fetch and put operations.
> Is there any benchmark for these or based on my values in previous mail can
> something be commented on its performance.
> 
> 
> Lastly can we get some help on names like new-part-advice-d1094e71-0f59-
> 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name of thread
> like new-advice-1-StreamThread-1(as in version 10.1.1) so we can log these
> metrics as part of out cron jobs.
> 
> Thanks
> Sachin
> 
> 
> 
> On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <eno.there...@gmail.com> wrote:
> 
>> Hi Sachin,
>> 
>> The new streams metrics are now documented at https://kafka.apache.org/
>> documentation/#kafka_streams_monitoring <https://kafka.apache.org/
>> documentation/#kafka_streams_monitoring>. Note that not all of them are
>> turned on by default.
>> 
>> We have several benchmarks that run nightly to monitor streams
>> performance. They all stem from the SimpleBenchmark.java benchmark. In
>> addition, their results are published nightly here
>> http://testing.confluent.io <http://testing.confluent.io/>, (e.g., under
>> the trunk results). E.g., looking at today's results:
>> http://confluent-kafka-system-test-results.s3-us-west-2.
>> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
>> ef92bb4/report.html <http://confluent-kafka-system-test-results.s3-us-
>> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
>> trunk--ef92bb4/report.html>
>> (if you search for "benchmarks.streams") you'll see results from a series
>> of benchmarks, ranging from simply consuming, to simple topologies with a
>> source and sink, to joins and count aggregate. These run on AWS nightly,
>> but you can also run manually on your setup.
>> 
>> In addition, programmatically the code can check the KafkaStreams.state()
>> and register listeners for when the state changes. For example, the state
>> can change from "running" to "rebalancing".
>> 
>> It is likely we'll need more metrics moving forward and would be great to
>> get feedback from the community.
>> 
>> 
>> Thanks
>> Eno
>> 
>> 
>> 
>> 
>>> On 2 Mar 2017, at 11:54, Sachin Mittal <sjmit...@gmail.com> wrote:
>>> 
>>> Hello All,
>>> I had few questions regarding monitoring of kafka streams application and
>>> what are some important metrics we should collect in our case.
>>> 
>>> Just a brief overview, we have a single thread application (0.10.1.1)
>>> reading from single partition topic and it is working all fine.
>>> Then we have same application (using 0.10.2.0) multi threaded with 4
>>> threads per machine and 3 machines cluster setup reading for same but
>>> partitioned topic (12 partitions).
>>> Thus we have each thread processing single partition same case as earlier
>>> one.
>>> 
>>> The new setup also works fine in steady state, but under load somehow it
>>> triggers frequent re-balance and then we run into all sort of issues like
>>> stream thread dying due to CommitFailedException or entering into
>> deadlock
>>> state.
>>> After a while we restart all the instances then it works fine for a while
>>> and again we get the same problem and it goes on.
>>> 
>>> 1. So just to monitor, like when first thread fails what would be some
>>> important metrics we should be collecting to get some sense of whats
>> going
>>> on?
>>> 
>>> 2. Is there any metric that tells time elapsed between successive poll
>>> requests, so we can monitor that?
>>> 
>>> Also I did monitor rocksdb put and fetch times for these 2 instances and
>>> here is the output I get:
>>> 0.10.1.1
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-put-avg-latency-ms
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 206431.7497615029
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-fetch-avg-latency-ms
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 2595394.2746129474
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-put-qps
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 232.86299499317252
>>> $>get -s  -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>> id=new-advice-1-StreamThread-1
>>> key-table-fetch-qps
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-advice-1-StreamThread-1:
>>> 373.61071016166284
>>> 
>>> Same values for 0.10.2.0 I get
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-put-latency-avg
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 1199859.5535022356
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-fetch-latency-avg
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 3679340.80748852
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-put-rate
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 56.134778706069184
>>> $>get -s -b kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>> key-table-fetch-rate
>>> #mbean = kafka.streams:type=stream-rocksdb-window-metrics,client-
>>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1:
>>> 136.10721427931827
>>> 
>>> I notice that result in 10.2.0 is much worse than same for 10.1.1
>>> 
>>> I would like to know
>>> 1. Is there any benchmark on rocksdb as at what rate/latency it should be
>>> doing put/fetch operations.
>>> 
>>> 2. What could be the cause of inferior numbers in 10.2.0, is it because
>>> this application is also running three other threads doing the same
>> thing.
>>> 
>>> 3. Also whats with the name new-part-advice-d1094e71-
>>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
>>>   I wanted to put this as a part of my cronjob, so why can't we have
>>> simpler name like we have in 10.1.1, so it is easy to write the script.
>>> 
>>> Thanks
>>> Sachin
>> 
>>

Re: Need some help in identifying some important metrics to monitor for streams

Reply via email to