Re: Need some help in identifying some important metrics to monitor for streams

Sachin Mittal Sat, 04 Mar 2017 23:15:17 -0800

Yes setting client id works. Now we are able to add metrics as part of our
cron job.


One additional question I have is
http://kafka.apache.org/documentation.html#kafka_streams_monitoring

I am monitoring the commit latency and process latency.
Commit latency is usually say in 1000ms and process latency is usually say
in 1ms.
So 3 order of magnitude less than commit latency.

This makes sense because in our commit phase ie forEach we do some external
db operation.

I just wanted to understand say in single poll request if it fetches n
records does the above values indicate time computed for all n records or
just a single record.

or is it the total average time to process these records = n * process
latency + commit latency  before making another poll request.

Basically we just want to know how often is poll getting called just to see
how close is it to MAX_POLL_INTERVAL_MS_CONFIG.

Thanks
Sachin


On Sun, Mar 5, 2017 at 11:42 AM, Guozhang Wang <wangg...@gmail.com> wrote:

> That is right, since client-id is used as the metrics name which should be
> distinguishable.
>
> https://kafka.apache.org/documentation/#streamsconfigs (I think we can
> improve on the explanation of the client.id config)
>
> A common client-id could contain the machine's host-port; of course, if you
> have more than one Streams instances running on the same machine that wont
> work and you need to consider using more information.
>
> Again the client-id config is not required, and when not specified Streams
> will use an UUID suffix to achieve uniqueness but as you observed it is
> less human readable for monitoring.
>
>
> Guozhang
>
> On Fri, Mar 3, 2017 at 5:18 PM, Sachin Mittal <sjmit...@gmail.com> wrote:
>
> > Son if I am running my stream and across a cluster of different machine
> > each machine should have a different client id.
> >
> > On 4 Mar 2017 12:36 a.m., "Guozhang Wang" <wangg...@gmail.com> wrote:
> >
> > > Sachin,
> > >
> > > The reason that you got metrics name as
> > >
> > > new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > >
> > >
> > > Is that you did not set the "CLIENT_ID_CONFIG" in your app, and
> > > KafkaStreams have to use a default combo of "appID:
> > > new-part-advice"-"processID: a UUID to guarantee uniqueness across
> > > machines" as its clientId.
> > >
> > >
> > > As for metricsName, it is always set as "clientId + "-" + threadName"
> > where
> > > "StreamThread-1" is your threadName which is unique WITHIN the JVM and
> > that
> > > is why we still need the globally unique clientId for distinguishment.
> > >
> > > I just checked the source code and this logic was not changed from
> 0.10.1
> > > to 0.10.2, so I guess you set your clientId as "new-advice-1" as well
> in
> > > 0.10.1?
> > >
> > >
> > > Guozhang
> > >
> > >
> > >
> > > On Fri, Mar 3, 2017 at 4:02 AM, Eno Thereska <eno.there...@gmail.com>
> > > wrote:
> > >
> > > > Hi Sachin,
> > > >
> > > > Now that the confluent platform 3.2 is out, we also have some more
> > > > documentation on this here: http://docs.confluent.io/3.2.
> > > > 0/streams/monitoring.html <http://docs.confluent.io/3.2.
> > > > 0/streams/monitoring.html>. We added a note on how to add other
> > metrics.
> > > >
> > > > Yeah, your calculation on poll time makes sense. The important
> metrics
> > > are
> > > > the “info” ones that are on by default. However, for stageful
> > > applications,
> > > > if you suspect that state stores might be bottlenecking, you might
> want
> > > to
> > > > collect those metrics too.
> > > >
> > > > On the benchmarks, the one called “processstreamwithstatestore” and
> > > > “count” are the closest to a benchmarking on RocksDb with the default
> > > > configs. The first writes each record to RocksDb, while the second
> > > performs
> > > > simple aggregates (reads and writes from/to RocksDb).
> > > >
> > > > We might need to add more benchmarks here, would be great to get some
> > > > ideas and help from the community. E.g., a pure RocksDb benchmark
> that
> > > > doesn’t go through streams at all.
> > > >
> > > > Could you open a JIRA on the name issue please? As an “improvement”.
> > > >
> > > > Thanks
> > > > Eno
> > > >
> > > >
> > > >
> > > > > On Mar 2, 2017, at 6:00 PM, Sachin Mittal <sjmit...@gmail.com>
> > wrote:
> > > > >
> > > > > Hi,
> > > > > I had checked the monitoring docs, but could not figure out which
> > > metrics
> > > > > are important ones.
> > > > >
> > > > > Also mainly I am looking at the average time spent between 2
> > successive
> > > > > poll requests.
> > > > > Can I say that average time between 2 poll requests is sum of
> > > > >
> > > > > commit + poll + process + punctuate (latency-avg).
> > > > >
> > > > >
> > > > > Also I checked the benchmark tests results but could not find any
> > > > > information on rocksdb metrics for fetch and put operations.
> > > > > Is there any benchmark for these or based on my values in previous
> > mail
> > > > can
> > > > > something be commented on its performance.
> > > > >
> > > > >
> > > > > Lastly can we get some help on names like
> > > new-part-advice-d1094e71-0f59-
> > > > > 45e8-98f4-477f9444aa91-StreamThread-1 and have more standard name
> of
> > > > thread
> > > > > like new-advice-1-StreamThread-1(as in version 10.1.1) so we can
> log
> > > > these
> > > > > metrics as part of out cron jobs.
> > > > >
> > > > > Thanks
> > > > > Sachin
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Mar 2, 2017 at 9:31 PM, Eno Thereska <
> eno.there...@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > >> Hi Sachin,
> > > > >>
> > > > >> The new streams metrics are now documented at
> > > https://kafka.apache.org/
> > > > >> documentation/#kafka_streams_monitoring <
> https://kafka.apache.org/
> > > > >> documentation/#kafka_streams_monitoring>. Note that not all of
> them
> > > are
> > > > >> turned on by default.
> > > > >>
> > > > >> We have several benchmarks that run nightly to monitor streams
> > > > >> performance. They all stem from the SimpleBenchmark.java
> benchmark.
> > In
> > > > >> addition, their results are published nightly here
> > > > >> http://testing.confluent.io <http://testing.confluent.io/>,
> (e.g.,
> > > > under
> > > > >> the trunk results). E.g., looking at today's results:
> > > > >> http://confluent-kafka-system-test-results.s3-us-west-2.
> > > > >> amazonaws.com/2017-03-02--001.1488449554--apache--trunk--
> > > > >> ef92bb4/report.html <http://confluent-kafka-
> > > system-test-results.s3-us-
> > > > >> west-2.amazonaws.com/2017-03-02--001.1488449554--apache--
> > > > >> trunk--ef92bb4/report.html>
> > > > >> (if you search for "benchmarks.streams") you'll see results from a
> > > > series
> > > > >> of benchmarks, ranging from simply consuming, to simple topologies
> > > with
> > > > a
> > > > >> source and sink, to joins and count aggregate. These run on AWS
> > > nightly,
> > > > >> but you can also run manually on your setup.
> > > > >>
> > > > >> In addition, programmatically the code can check the
> > > > KafkaStreams.state()
> > > > >> and register listeners for when the state changes. For example,
> the
> > > > state
> > > > >> can change from "running" to "rebalancing".
> > > > >>
> > > > >> It is likely we'll need more metrics moving forward and would be
> > great
> > > > to
> > > > >> get feedback from the community.
> > > > >>
> > > > >>
> > > > >> Thanks
> > > > >> Eno
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>> On 2 Mar 2017, at 11:54, Sachin Mittal <sjmit...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> Hello All,
> > > > >>> I had few questions regarding monitoring of kafka streams
> > application
> > > > and
> > > > >>> what are some important metrics we should collect in our case.
> > > > >>>
> > > > >>> Just a brief overview, we have a single thread application
> > (0.10.1.1)
> > > > >>> reading from single partition topic and it is working all fine.
> > > > >>> Then we have same application (using 0.10.2.0) multi threaded
> with
> > 4
> > > > >>> threads per machine and 3 machines cluster setup reading for same
> > but
> > > > >>> partitioned topic (12 partitions).
> > > > >>> Thus we have each thread processing single partition same case as
> > > > earlier
> > > > >>> one.
> > > > >>>
> > > > >>> The new setup also works fine in steady state, but under load
> > somehow
> > > > it
> > > > >>> triggers frequent re-balance and then we run into all sort of
> > issues
> > > > like
> > > > >>> stream thread dying due to CommitFailedException or entering into
> > > > >> deadlock
> > > > >>> state.
> > > > >>> After a while we restart all the instances then it works fine
> for a
> > > > while
> > > > >>> and again we get the same problem and it goes on.
> > > > >>>
> > > > >>> 1. So just to monitor, like when first thread fails what would be
> > > some
> > > > >>> important metrics we should be collecting to get some sense of
> > whats
> > > > >> going
> > > > >>> on?
> > > > >>>
> > > > >>> 2. Is there any metric that tells time elapsed between successive
> > > poll
> > > > >>> requests, so we can monitor that?
> > > > >>>
> > > > >>> Also I did monitor rocksdb put and fetch times for these 2
> > instances
> > > > and
> > > > >>> here is the output I get:
> > > > >>> 0.10.1.1
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-put-avg-latency-ms
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 206431.7497615029
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-fetch-avg-latency-ms
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 2595394.2746129474
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-put-qps
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 232.86299499317252
> > > > >>> $>get -s  -b kafka.streams:type=stream-
> > > rocksdb-window-metrics,client-
> > > > >> id=new-advice-1-StreamThread-1
> > > > >>> key-table-fetch-qps
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-advice-1-StreamThread-1:
> > > > >>> 373.61071016166284
> > > > >>>
> > > > >>> Same values for 0.10.2.0 I get
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-put-latency-avg
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 1199859.5535022356
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-fetch-latency-avg
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 3679340.80748852
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-put-rate
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 56.134778706069184
> > > > >>> $>get -s -b kafka.streams:type=stream-
> > rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > StreamThread-1
> > > > >>> key-table-fetch-rate
> > > > >>> #mbean = kafka.streams:type=stream-
> rocksdb-window-metrics,client-
> > > > >>> id=new-part-advice-d1094e71-0f59-45e8-98f4-477f9444aa91-
> > > > StreamThread-1:
> > > > >>> 136.10721427931827
> > > > >>>
> > > > >>> I notice that result in 10.2.0 is much worse than same for 10.1.1
> > > > >>>
> > > > >>> I would like to know
> > > > >>> 1. Is there any benchmark on rocksdb as at what rate/latency it
> > > should
> > > > be
> > > > >>> doing put/fetch operations.
> > > > >>>
> > > > >>> 2. What could be the cause of inferior numbers in 10.2.0, is it
> > > because
> > > > >>> this application is also running three other threads doing the
> same
> > > > >> thing.
> > > > >>>
> > > > >>> 3. Also whats with the name new-part-advice-d1094e71-
> > > > >>> 0f59-45e8-98f4-477f9444aa91-StreamThread-1
> > > > >>>   I wanted to put this as a part of my cronjob, so why can't we
> > have
> > > > >>> simpler name like we have in 10.1.1, so it is easy to write the
> > > script.
> > > > >>>
> > > > >>> Thanks
> > > > >>> Sachin
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>
>
>
> --
> -- Guozhang
>

Re: Need some help in identifying some important metrics to monitor for streams

Reply via email to