Github user zd-project commented on the issue: https://github.com/apache/storm/pull/2710 New supervisor level metrics: - [ ] Worker Kill/Restart Statistics - [x] Kill Count by Category - assignment change/HB too old/Heap space (memory limit?) - [x] blob change? - [ ] Worker Suicide Cnt - category: internal error or Assignment Change - [x] - Implemented based on running status the container's main process. Does not actually reflect suicide count because it counts the normal exit as well. - [x] Worker idle period - The metrics records the duration machines spent in each state (in histogram) and how many times it transition into/out to a certain state. - [x] Time to Actually Kill worker (from identifying need by supervisor and actual change in the state of the worker) - (This is only an estimation, accuracy affected by SleepTime) - [x] Time to start worker for topology from reading assignment for the first time. - [x] Worker cleanup time - [x] Supervisor Level Metrics: - [x] Supervisor restart Count - simply report everytime it restarts. - [x] Blobstore (Request to download time) - [x] download time individual blob (inside localizer) localizer gettting requst to actually download hdfs request to finish - I assume this to be [the complete process] from initiating download to commit download to local blob cache and inform relative workers - [x] download rate individual blob (inside localizer) - This is tracks the actual download rate of a blob retrieval, in MB/s - [x] supervisor localizer thread blob download - how long (outside localizer) - I put this inside async localizer as it turns out to be better suited for purpose. This tracks the time for a topology blob download request to be completely processed. - [x] Blob update is also considered. - [x] Blobstore Update due to Version change Cnts
---