Github user zd-project commented on the issue:
https://github.com/apache/storm/pull/2710
New supervisor level metrics:
- [ ] Worker Kill/Restart Statistics
- [x] Kill Count by Category - assignment change/HB too old/Heap space
(memory limit?)
- [x] blob change?
- [ ] Worker Suicide Cnt - category: internal error or Assignment
Change
- [x] - Implemented based on running status the container's
main process. Does not actually reflect suicide count because it counts the
normal exit as well.
- [x] Worker idle period
- The metrics records the duration machines spent in each state
(in histogram) and how many times it transition into/out to a certain state.
- [x] Time to Actually Kill worker (from identifying need by supervisor
and actual change in the state of the worker) - (This is only an estimation,
accuracy affected by SleepTime)
- [x] Time to start worker for topology from reading assignment for the
first time.
- [x] Worker cleanup time
- [x] Supervisor Level Metrics:
- [x] Supervisor restart Count
- simply report everytime it restarts.
- [x] Blobstore (Request to download time)
- [x] download time individual blob (inside localizer)
localizer gettting requst to actually download hdfs request to finish
- I assume this to be [the complete process] from
initiating download to commit download to local blob cache and inform relative
workers
- [x] download rate individual blob (inside localizer)
- This is tracks the actual download rate of a blob
retrieval, in MB/s
- [x] supervisor localizer thread blob download - how long
(outside localizer)
- I put this inside async localizer as it turns out to
be better suited for purpose. This tracks the time for a topology blob download
request to be completely processed.
- [x] Blob update is also considered.
- [x] Blobstore Update due to Version change Cnts
---