[GitHub] incubator-trafficcontrol pull request #772: Fix Traffic Monitor to startup w...

2017-08-05 Thread rob05c
Github user rob05c commented on a diff in the pull request:


https://github.com/apache/incubator-trafficcontrol/pull/772#discussion_r131534318
  
--- Diff: traffic_monitor_golang/traffic_monitor/threadsafe/polledcaches.go 
---
@@ -103,11 +116,22 @@ func copyCaches(a map[enum.CacheName]struct{}) 
map[enum.CacheName]struct{} {
return b
 }
 
+func copyCachesTime(a map[enum.CacheName]time.Time) 
map[enum.CacheName]time.Time {
+   b := map[enum.CacheName]time.Time{}
+   for k, v := range a {
+   b[k] = v
+   }
+   return b
+}
+
+const PolledBytesPerSecTimeout = time.Second * 60
--- End diff --

Changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-trafficcontrol pull request #772: Fix Traffic Monitor to startup w...

2017-08-03 Thread mtorluemke
Github user mtorluemke commented on a diff in the pull request:


https://github.com/apache/incubator-trafficcontrol/pull/772#discussion_r131261159
  
--- Diff: traffic_monitor_golang/traffic_monitor/threadsafe/polledcaches.go 
---
@@ -103,11 +116,22 @@ func copyCaches(a map[enum.CacheName]struct{}) 
map[enum.CacheName]struct{} {
return b
 }
 
+func copyCachesTime(a map[enum.CacheName]time.Time) 
map[enum.CacheName]time.Time {
+   b := map[enum.CacheName]time.Time{}
+   for k, v := range a {
+   b[k] = v
+   }
+   return b
+}
+
+const PolledBytesPerSecTimeout = time.Second * 60
--- End diff --

"Trouble communicating with a cache" should, by itself, mark the cache as 
unhealthy.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-trafficcontrol pull request #772: Fix Traffic Monitor to startup w...

2017-08-03 Thread rob05c
Github user rob05c commented on a diff in the pull request:


https://github.com/apache/incubator-trafficcontrol/pull/772#discussion_r131190134
  
--- Diff: traffic_monitor_golang/traffic_monitor/threadsafe/polledcaches.go 
---
@@ -103,11 +116,22 @@ func copyCaches(a map[enum.CacheName]struct{}) 
map[enum.CacheName]struct{} {
return b
 }
 
+func copyCachesTime(a map[enum.CacheName]time.Time) 
map[enum.CacheName]time.Time {
+   b := map[enum.CacheName]time.Time{}
+   for k, v := range a {
+   b[k] = v
+   }
+   return b
+}
+
+const PolledBytesPerSecTimeout = time.Second * 60
--- End diff --

It makes me anxious that if we had trouble communicating with a cache which 
was overloaded with traffic for 10s, we'd tell the Router it's online and has 0 
bytes/second and to send it lots of traffic.

It'd take a bit more work to make it a config, that object in the code has 
no access to the config right now. I'm not a big fan of adding configs for 
every possible thing, though, every config option is more ops work and 
training. One doesn't seem like a big deal, but they add up. This should be a 
rare case; it just happens that we've allowed people to put a dysfunctional 
'cache' on our CDN.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-trafficcontrol pull request #772: Fix Traffic Monitor to startup w...

2017-08-03 Thread dneuman64
Github user dneuman64 commented on a diff in the pull request:


https://github.com/apache/incubator-trafficcontrol/pull/772#discussion_r131174165
  
--- Diff: traffic_monitor_golang/traffic_monitor/threadsafe/polledcaches.go 
---
@@ -103,11 +116,22 @@ func copyCaches(a map[enum.CacheName]struct{}) 
map[enum.CacheName]struct{} {
return b
 }
 
+func copyCachesTime(a map[enum.CacheName]time.Time) 
map[enum.CacheName]time.Time {
+   b := map[enum.CacheName]time.Time{}
+   for k, v := range a {
+   b[k] = v
+   }
+   return b
+}
+
+const PolledBytesPerSecTimeout = time.Second * 60
--- End diff --

Can we make it a config?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-trafficcontrol pull request #772: Fix Traffic Monitor to startup w...

2017-08-03 Thread mtorluemke
Github user mtorluemke commented on a diff in the pull request:


https://github.com/apache/incubator-trafficcontrol/pull/772#discussion_r131169393
  
--- Diff: traffic_monitor_golang/traffic_monitor/threadsafe/polledcaches.go 
---
@@ -103,11 +116,22 @@ func copyCaches(a map[enum.CacheName]struct{}) 
map[enum.CacheName]struct{} {
return b
 }
 
+func copyCachesTime(a map[enum.CacheName]time.Time) 
map[enum.CacheName]time.Time {
+   b := map[enum.CacheName]time.Time{}
+   for k, v := range a {
+   b[k] = v
+   }
+   return b
+}
+
+const PolledBytesPerSecTimeout = time.Second * 60
--- End diff --

Good concept -- I would pick something closer to 10 seconds, though.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-trafficcontrol pull request #772: Fix Traffic Monitor to startup w...

2017-08-03 Thread rob05c
GitHub user rob05c opened a pull request:

https://github.com/apache/incubator-trafficcontrol/pull/772

Fix Traffic Monitor to startup when cache bytes don't change

This fixes an issue where TM refused to start if a cache's bytes in
astats never change. TM returns 503 until it's polled every cache
twice, and has a BytesPerSecond calculation. The issue was, if a cache
had hard-coded proc.net.dev which never changed, TM has no way to know
the cache simply hasn't updated its stats, and that we won't get a
different stat on the next poll.

This gives TM a timeout, once it successfully polls a cache once,
if the stats don't change in 60s, we start up anyway. It's not
reasonable for a cache to update stats less frequently than a minute.
While guessing isn't ideal, there's no good alternative. We can't
know the cache simply hasn't updated its stats internally; the
only option is a timeout, or never starting.

Incidentally, this problem occurred in production. We had a bad
cache lying to us with hard-coded astats that never updated, causing
TM to never start.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rob05c/incubator-trafficcontrol 
tm-fixstartupwithbadastats

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-trafficcontrol/pull/772.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #772


commit 4fd0b6d4375f8a78254aff970c9bdb696bc56bc5
Author: Robert Butts 
Date:   2017-08-03T14:29:00Z

Fix TM2 to startup when cache bytes don't change

This fixes an issue where TM refused to start if a cache's bytes in
astats never change. TM returns 503 until it's polled every cache
twice, and has a BytesPerSecond calculation. The issue was, if a cache
had hard-coded proc.net.dev which never changed, TM has no way to know
the cache simply hasn't updated its stats, and that we won't get a
different stat on the next poll.

This gives TM a timeout, once it successfully polls a cache once,
if the stats don't change in 60s, we start up anyway. It's not
reasonable for a cache to update stats less frequently than a minute.
While guessing isn't ideal, there's no good alternative. We can't
know the cache simply hasn't updated its stats internally; the
only option is a timeout, or never starting.

Incidentally, this problem occurred in production. We had a bad
cache lying to us with hard-coded astats that never updated, causing
TM to never start.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---