Re: Backup Cache Group Selection

Eric Friedrich (efriedri) Tue, 09 May 2017 18:00:56 -0700

backupList is not planned because the coordinates approach was sufficient

—Eric


On May 9, 2017, at 6:57 AM, Ori Finkelman 
<o...@qwilt.com<mailto:o...@qwilt.com>> wrote:

Hi again,
I understand now that the "backupList" feature does not exist yet.
What is the status of this feature ? is it planned ?

Thanks,
Ori

On Mon, May 8, 2017 at 4:18 PM, Ori Finkelman 
<o...@qwilt.com<mailto:o...@qwilt.com>> wrote:

Hi,
Following up on this one, it seems that both czf attributes described in
this thread, the "coordinates" and the "backupList" are not documented in
the official docs in
http://trafficcontrol.incubator.apache.org/docs/latest/admin/traffic_ops_
using.html#the-coverage-zone-file-and-asn-table

Is there a plan to update the documentation ? should I open a JIRA for it ?

Thanks,
Ori

On Thu, Mar 30, 2017 at 8:45 PM, Jeff Elsloo <jeff.els...@gmail.com>
wrote:

Yes, that's correct.
--
Thanks,
Jeff


On Thu, Mar 30, 2017 at 11:20 AM, Eric Friedrich (efriedri)
<efrie...@cisco.com> wrote:
Thanks Jeff-
 Could I think of it as the following? Echoing back to be sure I
understand...

If there is a lat/long for a cache group in the CZF file, any client
hit to that CG should use the CZF lat/long as it client’s lat/long instead
of using geolocation.

For the purposes of finding closest cache group, the client’s location
(from CZF as above or from Geolocation provider) will be compared against
the location of the cache’s as configuration in Traffic Op’s CG record?

—Eric


On Mar 30, 2017, at 1:07 PM, Jeff Elsloo <jeff.els...@gmail.com>
wrote:

It could now be considered the "average" of the location of the
clients within that section of the CZF, however, it should be noted
that the addition of the geo coordinates to the CZF is relatively new.
Previously we never had the ability to specify lat/long on those
cachegroups, and we solely relied on those specified in edgeLocations,
meaning that the matches had to be 1:1. Adding the coordinates allowed
us to cover edge cases and miss scenarios and stick to the CZF
whenever possible. Previously when we had no coordinates, and we had a
hit in the CZF but not corresponding hit within the edgeLocations
(health, assignments, etc), we would fall back to the Geolocation
provider.
--
Thanks,
Jeff


On Thu, Mar 30, 2017 at 5:29 AM, John Shen (weifensh)
<weife...@cisco.com> wrote:
Thanks Jeff and Oren for the discussion. I agree now that lat/long
from CZF is the “average” location of clients, and lat/long from Ops is the
location of a certain Cache Group. So it appears to be reasonable to use
them as source and dest to calculate the distance.

Thanks,
John


On 30/03/2017, 6:55 PM, "Oren Shemesh" <or...@qwilt.com> wrote:

  Jeff, having read this conversation more than once, I believe
there is a
  misunderstanding regarding the ability to provide coordinates for
cache
  groups both in the CZF and in the TO DB.

  Here is what I believe is a description which may help
understanding the
  current behaviour:

  The coordinates specified in the CZF for a cache group are not
supposed to
  be the exactly same as the coordinates in the TO DB for the same
cache
  group.
  This is because they do not represent the location of the caches
of the
  group.
  They represent the (average) location of clients found in the
subnets
  specified for this cache group.

  This, I believe, explains both the behaviour of the code (Why the
  coordinates from the CZF are used for the source, but the
coordinates from
  the TO DB are used for the various candidate cache groups), and
the fact
  that there is a 'duplication'.

  Is this description true ?



  On Wed, Mar 29, 2017 at 7:02 PM, Jeff Elsloo <els...@apache.org>
wrote:

The cachegroup settings in the Traffic Ops GUI end up in the
`edgeLocations` section of the CRConfig. This is the source of truth
for where caches are deployed, logically or physically. We do not
provide a means to generate a CZF in Traffic Ops, so it's up to the
end user to craft one to match what is in Traffic Ops.

There are several cases that need to be accounted for where a hit in
the CZF does match what's in `edgeLocations`, but cannot be served
there due to cache health, delivery service health, or delivery
service assignments. The other edge case is a hit where no
`edgeLocation` exists, which again, must be accounted for. Presumably
we have higher fidelity data in our CZF than we would in our
Geolocation provider and we should use it whenever possible.

Think about this: what if you use the same CZF for two configured
CDNs, but one of the two CDNs only has caches deployed to 50% of the
cache groups defined in the CZF. Would we want to use the Geolocation
provider in the event that our source address matches a cachegroup
that does not have any assigned caches? We would ideally have as much
granularity as possible in the CZF, then use that to inform the
decision about which cachegroup should service the request instead of
falling back to a lower fidelity datasource. This is especially true
in the case of RFC 1918 addresses that might appear in one's CZF.

Thanks,
Jeff


On Wed, Mar 29, 2017 at 9:12 AM, John Shen (weifensh)
<weife...@cisco.com> wrote:
Hi Jeff,

Thank you for the detail. I am wondering why there are two sets of
lat/lang,
i.e. one in CZF, the other is in Ops GUI Cache Group setting. To
calculate
the closest CG when matched CG in CZF is not available, the source
lat/long
is from mathced CZF, and the dest lat/long is from Ops setting,
which
doesnt
seem to be consistent. Is there any reason why TR has this behavior?

Since there are two sets of lat/long in TR, can we just use the
lat/long
all
from Ops CG settings to get the closest, and do not care about the
values
set in CZF? At least this will avoid inconsistent config for
lat/long.

Thanks,
John

---Original---
From: "Jeff Elsloo "<els...@apache.org>
Date: 2017/3/29 22:45:12
To:
"dev@trafficcontrol.incubator.apache.org"<dev@trafficcontrol
.incubator.
apache.org>;
Subject: Re: Backup Cache Group Selection

Yes, it's expected behavior. What you're describing sounds like a
cachegroup in the CZF without any corresponding configuration in
Traffic Ops, or a cachegroup with configuration in Traffic Ops, but
with no available caches (DS assignments, health, etc).

Presuming we have configured geolocation coordinates within the CZF,
we know the lat/long of the cachegroup within the CZF that contains
the source address. We can then order our list of cachegroups by
lat/long, then select the "next best" cache group by distance and
availability. That will be the actual cachegroup to serve the
request;
this prevents a miss on the CZF that would normally be routed to the
Geolocation service selected for the DS.

We do have a slight gap around logging, and maybe that's part of the
question. What we see in the log is the selected lat/long, not the
source lat/long of the hit, so we can't easily tell when we're in
this
case by simply looking at logs. This could be an area of
improvement,
however, we'll need to be careful to not conflate the logs with
unnecessary information. In most cases the hit is the selected
cachegroup, so we need to be careful to not just add "source" and
"actual" coordinates to the log because it'll be identical in most
CZF
hit cases.

Thanks,
Thanks,
Jeff


On Wed, Mar 29, 2017 at 7:02 AM, John Shen (weifensh)
<weife...@cisco.com> wrote:
Hi Jeff,

I have just tried the getClosestCacheLocation() logic. It appears
the
CZF
matched lat/long does come from CZF, but the lat/long of the
“closest”
Cache
Groups is from the configuration by Ops. This means to calculate
the
distance from the matched CG and “closest” CG, the source lat/long
is
from
CZF, but the dest lat/long is not from CZF but from CG settings on
Ops.
Is
this expected behavior?

Thanks,
John


On 27/01/2017, 10:51 PM, "Jeff Elsloo" <jeff.els...@gmail.com>
wrote:

  Steve: I don't think the patch is required, however, as Eric
found,
  without the patch there could be some gaps depending on the
scenario.
  That specific scenario revolved around the "next best cache
group"
not
  having a DS assigned, or a healthy cache with the DS assigned.
In
that
  case, despite the hits, you would still end up falling through
to
the
  geolocation provider. The patch addresses that.

  Eric: The rloc field is set via the Geolocation associated with
the
  CacheLocation, which ultimately comes from the edgeLocations
section
  of the CRConfig. When a CZF lookup is performed inside TR, a hit
  returns a CacheLocation. When caches aren't available within
that
  CacheLocation, getClosestCacheLocation() is called, and that's
why
you
  see the lat/long of the "next best cache group" instead of the
actual
  hit's lat/long.

  If we want to have granularity in this situation, we might need
to
1)
  create a new RestultType, such as ResultType.CZ_NEXT (or
something),
  and/or 2) massage the log format such that we either have a the
  original lat/long, and new lat/long in the rloc field, or
create a
new
  field to save one or the other, such that we log both lat/longs.

  Thoughts? Whatever we decide should go into TC-90 so we can
apply
the
  proposed patch and improve the logging.
  --
  Thanks,
  Jeff


  On Fri, Jan 27, 2017 at 7:14 AM, Eric Friedrich (efriedri)
  <efrie...@cisco.com> wrote:
The rloc field usually indicates the Geolocation IP of the client
(short for request location)

But here it looks like rloc is reflecting the location of the CG
it
ultimately redirected to (response location?).

I would have expected the rloc field to either
 1) be blank (because we never did a lookup from geoprovider)
      or
 2)  to contain the coordinates of the cache group the CZF hit
on
(in this case us-ga-macon at 32.7261, -83.6547”)

—Eric

On Jan 27, 2017, at 8:28 AM, Steve Malenfant <
smalenf...@gmail.com>
wrote:

Jeff,

CZF properly installed: yes
Network address or not: same behavior

But you nailed the API one. There is no cache assigned to
us-ga-macon,
which is exactly what I'm testing.

I added cache groups for my testing in the lab which I assigned a
few
caches to them :

- us-ga-atlanta 34.0362 -84.3207
- us-ok-oklahomacity 35.4777 -97.5545
- us-va-nova 38.7922 -77.2136
- us-ca-sandiego 32.7205 -117.0838

API :

{"locationByGeo":{"city":"Macon","countryCode":"US","
latitude":"32.7288","postalCode":"31216","countryName":"United
States","longitude":"-83.6865"},"locationByFederation":"not
found","requestIp":"24.252.192.1","locationByCoverageZone":"not
found"}

Using the X-MM-Client-IP it returned the proper cache based on
CZ,
it
correctly sent the request to the cache in us-ga-atlanta :
1485522786.423 qtype=HTTP chi=24.252.192.1 url="
http://crs.cox-col-jitp2.cdn1.coxlab.net/"; cqhm=GET
cqhv=HTTP/1.1
rtype=CZ
rloc="34.03,-84.32" rdtl=- rerr="-" rgb="-" pssc=302 ttms=0.260
rurl="
http://cdn1cdedge0007.cox-col-jitp2.cdn1.coxlab.net/"; rh="-"

I then changed the coordinate to match the us-ca-sandiego group
in
the CZF
and now the request is sent to the us-ca-sandiego caches :
1485523546.345 qtype=HTTP chi=24.252.192.1 url="
http://crs.cox-col-jitp2.cdn1.coxlab.net/"; cqhm=GET
cqhv=HTTP/1.1
rtype=CZ
rloc="32.72,-117.08" rdtl=- rerr="-" rgb="-" pssc=302 ttms=0.206
rurl="
http://cdn1cdedge0001.cox-col-jitp2.cdn1.coxlab.net/"; rh="-

I'm using 1.6.1 + patch discussed in this email. Not sure if
those
are
necessary but I'll need to try on unpatched version.

Do we want to fix API to reflect CZF?

Thanks for your help.

Steve








On Thu, Jan 26, 2017 at 4:47 PM, Jeff Elsloo
<jeff.els...@gmail.com> wrote:

Dave just let me know that in this case you don't have any
caches
assigned in us-ga-macon. I'm not sure how the API behaves at
that
point – it likely won't follow the same "next best cache group"
logic,
as it was designed as a simple lookup tool.

Can you try simulating a request through Traffic Router directly
using
the X-MM-Client-IP header, or fakeClientIpAddress query
parameter
using the example IP of 24.252.192.0? After you do so, check the
coordinates in the log entry and see if the result is a CZ hit.
--
Thanks,
Jeff


On Thu, Jan 26, 2017 at 2:03 PM, Jeff Elsloo
<jeff.els...@gmail.com>
wrote:
Are you 100% sure that the Traffic Router has loaded the
updated
CZF?
If so, what happens when you use an IP within the /20 instead
of
the
network address (.0)? I tried using a network address of a /22
on
a
1.8 TR and it hit the CZF as expected. Ultimately what you're
seeing
is a CZF miss, unrelated to the geo coordinates.

The underlying feature with the coordinates is to select the
next
best
cache group by proximity where healthy caches have a given
delivery
service assigned. In order to test that, you would need to
have a
CZF
hit in a cache group which doesn't have that particular
delivery
service assigned to any caches, or have all caches within that
cache
group with that delivery service in an unhealthy state.

Thanks,
--
Thanks,
Jeff


On Wed, Jan 25, 2017 at 1:33 PM, Steve Malenfant
<smalenf...@gmail.com>
wrote:
Jeff,

I've tried this coverage zone file coordinate overwrite... I
might be
missing something.

I defined the following :

     "us-ga-macon": {
         "coordinates": {
             "latitude": "32.7261",
             "longitude": "-83.6547"
         },
         "network": [
             "24.252.192.0/20",
             "68.1.20.0/22",


Then issued the following query :

curl http://traffic_router:3333/crs/stats/ip/24.252.192.0

{"locationByGeo":{"city":"Macon","countryCode":"US","
latitude":"32.7288","postalCode":"31216","countryName":"United
States","longitude":"-83.6865"},"locationByFederation":"not
found","requestIp":"24.252.192.0","
locationByCoverageZone":"not
found"}

I believe I'm expecting "locationByCoverageZone" to find
something...

I tried on 1.6.0 and 1.6.1 (patched with the pastebin above
which I
wasn't
sure I was suppose to do).

Would you mind giving me some light on this?

Thanks,

Steve


On Mon, Jan 23, 2017 at 3:05 PM, Jeff Elsloo
<jeff.els...@gmail.com>
wrote:

Yes; the feature went into 1.5.x.
--
Thanks,
Jeff


On Thu, Jan 19, 2017 at 10:37 AM, Steve Malenfant <
smalenf...@gmail.com>
wrote:
I didn't know about this which is good information. Does
that
work on
Traffic Router 1.6?

On Mon, Jan 9, 2017 at 12:44 PM, Eric Friedrich (efriedri) <
efrie...@cisco.com> wrote:

Jeff and I had a quick Slack convo, so I’ll add a followup
summary
here
in
case anyone else is interested.

Cache Group location (lat/long) is configured in Traffic
Ops
today
(and
is
used for computing distance from Maxmind Geolocation).

You can also configure the location (lat/long) for a Cache
Group in
the
CoverageZone file (example below).

When this location is configured (and Jeff’s suggested
logic
fix
from
below is applied) and all caches in the mapped cache group
are
unavailable,
TR will send a client request to the cache group that is
closest to
the
original mapped group.

Example CZF w/ cache location
-----
"coverageZones": {
 “edge-cg-1": {
   "network6": [
     ...
   ],
   "network": [
     ...
   ],
   "coordinates": {
     "longitude": “-75.3342",
     "latitude": “42.555"
   }
 },


—Eric


On Jan 5, 2017, at 12:06 PM, Jeff Elsloo
<jeff.els...@gmail.com>
wrote:

If we applied the proposed change, given your scenario we
should
fall
through to the return statement that calls
getClosestCacheLocation().
That method will order all cache groups based on their
lat/long
and
the lat/long of the cache group we hit on in the CZF. Once
the
list is
ordered, we iterate through the list until we find a cache
group
that
has available caches for that DS.

BTW, the stuff on line 536 is likely to produce the exact
same
result
as the check that precedes it. networkNode.getLoc() will
return
the
string name of the cache group, so when we find the
CacheLocation, it
will be the same as what we had just checked. We could
probably
get
away with removing that part of the method as it's
redundant.
--
Thanks,
Jeff


On Wed, Jan 4, 2017 at 11:54 AM, Eric Friedrich (efriedri)
<efrie...@cisco.com> wrote:
Where would TR look outside the assigned cache group to
find the
next
closest cache group?

On Jan 4, 2017, at 11:25 AM, Eric Friedrich (efriedri) <
efrie...@cisco.com> wrote:


On Jan 3, 2017, at 5:20 PM, Jeff Elsloo
<jeff.els...@gmail.com
<mailto:
jeff.els...@gmail.com>> wrote:

Hey Eric,

It sounds like the use case you're after is an RFC 1918
client
associated with a cache group whose caches are all
unavailable
for
one
reason or another. Is that correct?
Yes, exactly.


I looked at the code a bit, and I think that we can
make a
minor
change to achieve the behavior you're looking for as
long
as
you're
able to put your RFC 1918 ranges in the CZF.
Yes, we would want those ranges in the CZF. I can’t
think
of any
other
place they would go.


There's a small logic gap in the existing algorithm
around
cache
location selection and I think if we fix that (two line
change), we
should be better off all around. I think the only time
we'd ever
want
to go to the geolocation provider is in the event of a
miss on
the
CZF, so as long as we have a hit there, we should find
the
cache
group
closest to that hit location that has available caches.
This
would
automatically provide the "backup" cache group concept,
and has
the
added benefit of doing this selection dynamically based
on
the
state
of the CDN.
Wow, thanks for picking up on this solution. Sounds
like a
strong
possibility. I like that it can extend dynamically.



See this to get an idea of what I mean:
http://apaste.info/u3PQo
https://github.com/apache/
incubator-trafficcontrol/blob/
249bd7504eeb7cc43402126f3719017e2475ad33/traffic_router/
core/src/main/java/com/comcast/cdn/traffic_control/
traffic_router/core/router/TrafficRouter.java#L536
Does this line set cacheLocation to the closest cache
group with
active caches on that DS?

What does networkNode.getLoc() actually return?

—Eric



Obviously we'd need to test this to ensure we don't
break
other
functionality.
--
Thanks,
Jeff


On Tue, Jan 3, 2017 at 10:07 AM, Eric Friedrich
(efriedri)
<efrie...@cisco.com<mailto:efrie...@cisco.com>> wrote:
If all caches in the primary cache group are
unavailable,
our
goal
is
to provide a backup routing policy for RFC1918 clients.

When client IP is an public Internet IP, the current
backup
policy
is
to assign the client to the geographically closest cache
(Distance =
MaxMind Geo Lat/Long - configured CG lat/long).

When client IP is an RFC1918 IP, the client would not
have
a
maxmind
geo-loc, so would fall back to the DS geo-miss lat long.
We’d
prefer
some
more granular control over where these clients are routed
to,
rather
than a
per-DS setting.


So with an RFC1918 client, the lookup process would be
(step 3
is
only
addition)
1) Check CZF for a subnet match (and find a match for
existing
cache
group). Assign client to CG
2) Check CG for available (online and associated w/ DS)
servers. In
this particular case, assume CG has no servers available to
route
the
client to
3) Walk the CZF's list of backup CGs and perform the
check
from
#2
for
each CG. Use first server that is found
4) Assuming no server is found in #3, perform
geo-location
and
find
closest cache group. Use a server from the closest CG if
one
is
found
4a) If geo-location returns null, use the DS’ default
geo-miss
location as the client location.

—Eric


On Dec 26, 2016, at 10:01 AM, Jan van Doorn
<j...@knutsel.com
<mailto:
j...@knutsel.com>> wrote:

Hi Eric,

How does the backup list relate to the
RFC1918-is-not-in-geo
problem?

To get to a cachegroup you need to get a match in the
coverage
zone, I
would think?

Rgds,
JvD

On Dec 22, 2016, at 12:28, Eric Friedrich (efriedri) <
efrie...@cisco.com<mailto:efrie...@cisco.com>> wrote:

The current behavior of cache group selection works as
follows
1) Look for a subnet match in CZF
2) Use MaxMind/Neustar for GeoLocation based on client
IP.
Choose
closest cache group.
3) Use Delivery Service Geo-Miss Lat/Long. Choose
closest
cache
group.


For deployments where IP addressing is primarily private
(say
RFC-1918
addresses), client IP Geo Location (#2) is not useful.


We are considering adding another field to the Coverage
Zone
File
that
configures an ordered list of backup cache groups to try if
the
primary
cache group does not have any available caches.

Example:

"coverageZones": {
"cache-group-01": {
“backupList”: [“cache-group-02”, “cache-group-03”],
"network6": [
"1234:5678::\/64”,
"1234:5679::\/64"],
"network": [
"192.168.8.0\/24",
"192.168.9.0\/24”]
}

This configuration could also be part of the per-cache
group
configuration, but that would give less control over which
clients
preferred which cache groups. For example, you may have
cache
groups in
LA,
Chicago and NY. If the Chicago Cache group fails, you may
want some
of
the
Chicago clients to go to LA and some to go to NY. If the
backup CG
configuration is per-cg, we would not be able to control
where
clients
are
allocated.

Looking for opinions and comments on the above proposal,
this is
still
in idea stage.

Thanks All!
Eric

















  --

  *Oren Shemesh*
  Qwilt | Work: +972-72-2221637 <072-222-1637>| Mobile:
+972-50-2281168 <050-228-1168> | or...@qwilt.com<mailto:or...@qwilt.com>
  <y...@qwilt.com<mailto:y...@qwilt.com>>







--

*Ori Finkelman*Qwilt | Work: +972-72-2221647 <072-222-1647> | Mobile:
+972-52-3832189 <052-383-2189> | o...@qwilt.com<mailto:o...@qwilt.com>




--

*Ori Finkelman*Qwilt | Work: +972-72-2221647 | Mobile: +972-52-3832189 |
o...@qwilt.com<mailto:o...@qwilt.com>

Re: Backup Cache Group Selection

Reply via email to