Jake Maes created SAMZA-963:
-------------------------------

             Summary: Add timers to help identify performance issues with KV 
stores and producers.
                 Key: SAMZA-963
                 URL: https://issues.apache.org/jira/browse/SAMZA-963
             Project: Samza
          Issue Type: Improvement
            Reporter: Jake Maes


We have good timing metrics for many of the primary actions in the event loop:
* Choose
  * Deserialization
  * Poll
* Process
* Window
* Commit

I've noticed a few things while analyzing job performance at LinkedIn:
1. We can usually identify problems in Choose using the sub metrics for 
Deserialization and Poll. I don't think any work needs to be done here.

2. Slowness in Process or Window is usually caused by business logic (e.g. side 
calls to remote DBs), but it can also be caused by slowness (e.g. "stalls" in 
the case of RocksDB) in the KV Store. 

3. Slowness in Commit can be caused by slowness flushing the stores or 
producers. It can also come from checkpointing. 

#2 would be better if we had timers around all the main KV Store operations, 
including get, put, delete, and the batch operations. Then we can isolate KV 
Store performance from business logic performance. 

#3 would be improved if we had timers around all the flushes. Specifically, I 
think we should add a "flush-ns" metric to the KeyValueStoreMetrics and update 
it from each of the stores. Also, I noticed that KafkaSystemProducerMetrics has 
a "flush-ns" metric but none of the KafkaSystemProducerMetrics are actually 
emitted. We should figure out why. 

To summarize, this ticket is to add metrics around all KV Store operations and 
fix the KafkaSystemProducerMetrics. 

Related work: SAMZA-449



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to