Re: Consistent asymetric latency on monitoring?

2009-10-22 Thread Perry Lorier

Rick Ernst wrote:

Resent, since I responded from the wrong address:
---
The basic operation of IP SLA is as surmised; payload with timestamps
and other telemetry data is sent to a 'responder' which manipulates
the payload, including adding its own timestamps, and returns the
altered payload.
  


Yup :) It's the obvious way to do it :)


I had to do a mental walk-through, but I think I see how drift can
cause this. I'm going to generate some artificial data, graph it, and
see if it matches the general waveshape I'm seeing.

I purposefully have the traffic generators ntp syncing against the
responders. I thought that would keep the clocks more closely in sync.
I don't necessarily care if the time is 'right', just that it's the
same. 


This causes major problems.  What you're actually measuring here is how 
well ntp can keep the clock sync'd under assymetric latency.  ntp is 
trying to do it's own measurements of one way delay, without the help of 
clocks to measure clock drift as well.   As you can see from your graphs 
ntp is not coping[1].


You are far better to have each end sync to a local stratum 1 or stratum 
2 ntp source, preferably one over a different link to the one under 
test.  If you don't have a local stratum 1/2 time source at each end,  
you might be able find one over a local exchange or other less congested 
link.  If this is very important to you then you should consider looking 
at running your own stratum 1 clocks at each end syncronised off 
something like GPS, CDMA or a T1 clock.



What kind of difference should I expect if I sync both
generators and responders against the same source, or not sync the
responder? I'm thinking that having one source with constant drift may
be better than both devices trying to walk/correct the time.
  


Most hardware clocks in PC's/routers/switches etc have pretty atrocious 
amounts of drift if left to free run[2], sometimes in the order of 
seconds or occasionally minutes per week.  To get useful numbers you 
really do need to syncronise them to /something/.  Synchronising them to 
each other causes problems as ntp I think (I could be wrong) assumes 
mostly symmetrical latency, and if the latency isn't symmetric assumes 
it's because one clock is running fast/slow and will alter the clock's 
speed to account for it.  The great thing about ntp stratum 1 servers is 
that by definition they have more or less the same time no matter where 
they are, so synchronising each against a local ntp server will be a 
much much better solution.  If possible you should consider peering with 
at least 3 upstreams, preferably 4(!)[3] other ntp servers.


[1]: To be fair it's a hard problem.  Anything that involves time just 
gets more and more complicated the more you look at it, ntp is extremely 
clever and probably knows more about time than I'd ever want to know, 
but you're making it's job hard.


[2]: http://vancouver-webpages.com/time/ / 
http://vancouver-webpages.com/time/ltmhist.png


[3]: 
http://twiki.ntp.org/bin/view/Support/SelectingOffsiteNTPServers#Section_5.3.3.




Re: Consistent asymetric latency on monitoring?

2009-10-22 Thread Roland Dobbins


On Oct 21, 2009, at 11:03 PM, Rick Ernst wrote:

 I thought that would keep the clocks more closely in sync.  I don't  
necessarily care if the time is 'right', just that it's the same.


ntp is a pretty basic operational requirement for any network,  
irrespective of the use of IP SLA, is it not?


---
Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com

Sorry, sometimes I mistake your existential crises for technical
insights.

-- xkcd #625




Re: Consistent asymetric latency on monitoring?

2009-10-22 Thread Rick Ernst
Lots of good info, and a nice mind-dump that gives me a whole host of other
things that need to be looked at... Umm. thanks :)

On Wed, Oct 21, 2009 at 11:10 PM, Perry Lorier pe...@coders.net wrote:

 Rick Ernst wrote:

 Resent, since I responded from the wrong address:
 ---
 The basic operation of IP SLA is as surmised; payload with timestamps
 and other telemetry data is sent to a 'responder' which manipulates
 the payload, including adding its own timestamps, and returns the
 altered payload.



 Yup :) It's the obvious way to do it :)

  I had to do a mental walk-through, but I think I see how drift can
 cause this. I'm going to generate some artificial data, graph it, and
 see if it matches the general waveshape I'm seeing.

 I purposefully have the traffic generators ntp syncing against the
 responders. I thought that would keep the clocks more closely in sync.
 I don't necessarily care if the time is 'right', just that it's the
 same.


 This causes major problems.  What you're actually measuring here is how
 well ntp can keep the clock sync'd under assymetric latency.  ntp is trying
 to do it's own measurements of one way delay, without the help of clocks to
 measure clock drift as well.   As you can see from your graphs ntp is not
 coping[1].

 You are far better to have each end sync to a local stratum 1 or stratum 2
 ntp source, preferably one over a different link to the one under test.  If
 you don't have a local stratum 1/2 time source at each end,  you might be
 able find one over a local exchange or other less congested link.  If this
 is very important to you then you should consider looking at running your
 own stratum 1 clocks at each end syncronised off something like GPS, CDMA or
 a T1 clock.

  What kind of difference should I expect if I sync both
 generators and responders against the same source, or not sync the
 responder? I'm thinking that having one source with constant drift may
 be better than both devices trying to walk/correct the time.



 Most hardware clocks in PC's/routers/switches etc have pretty atrocious
 amounts of drift if left to free run[2], sometimes in the order of seconds
 or occasionally minutes per week.  To get useful numbers you really do need
 to syncronise them to /something/.  Synchronising them to each other causes
 problems as ntp I think (I could be wrong) assumes mostly symmetrical
 latency, and if the latency isn't symmetric assumes it's because one clock
 is running fast/slow and will alter the clock's speed to account for it.
  The great thing about ntp stratum 1 servers is that by definition they have
 more or less the same time no matter where they are, so synchronising each
 against a local ntp server will be a much much better solution.  If possible
 you should consider peering with at least 3 upstreams, preferably 4(!)[3]
 other ntp servers.

 [1]: To be fair it's a hard problem.  Anything that involves time just gets
 more and more complicated the more you look at it, ntp is extremely clever
 and probably knows more about time than I'd ever want to know, but you're
 making it's job hard.

 [2]: http://vancouver-webpages.com/time/ /
 http://vancouver-webpages.com/time/ltmhist.png

 [3]:
 http://twiki.ntp.org/bin/view/Support/SelectingOffsiteNTPServers#Section_5.3.3
 .



Consistent asymetric latency on monitoring?

2009-10-21 Thread Rick Ernst
Although the implementation is Cisco-specific, this feels more appropriate
for NANOG.

We've started rolling out a state-wide monitoring system based on Cisco's
IP SLA feature set.  Out of 5 sites deployed so far (different locations,
different providers), we are consistently seeing one-way latency mirror the
opposite direction. As source-destination latency goes up,
destination-source latency goes down and vice versa.

Myself and the monitoring team have ripped apart the OIDs, IP SLA
configuration, and monitoring system.  We've also built an ad-hoc system to
compare the results.  It's still consistent behavior.  It's not a true
mirror; there is definitely variation between the data collection, but at
the 10,000 foot level, there is an obvious and consistent mirror to the
data.

The network topology is independant service providers all providing backhaul
to a local ethernet exchange.

Has anybody seen this type of behavior? We are solidly convinced that we are
using the proper OIDs and making the proper transformations of the data.
The two remaining causes appear to be either natural behavior of the links
and/or artifact in the IP SLA mechanism.

Any ideas?


Thanks!


Re: Consistent asymetric latency on monitoring?

2009-10-21 Thread Perry Lorier

Rick Ernst wrote:

Although the implementation is Cisco-specific, this feels more appropriate
for NANOG.

We've started rolling out a state-wide monitoring system based on Cisco's
IP SLA feature set.  Out of 5 sites deployed so far (different locations,
different providers), we are consistently seeing one-way latency mirror the
opposite direction. As source-destination latency goes up,
destination-source latency goes down and vice versa.

Myself and the monitoring team have ripped apart the OIDs, IP SLA
configuration, and monitoring system.  We've also built an ad-hoc system to
compare the results.  It's still consistent behavior.  It's not a true
mirror; there is definitely variation between the data collection, but at
the 10,000 foot level, there is an obvious and consistent mirror to the
data.

The network topology is independant service providers all providing backhaul
to a local ethernet exchange.

Has anybody seen this type of behavior? We are solidly convinced that we are
using the proper OIDs and making the proper transformations of the data.
The two remaining causes appear to be either natural behavior of the links
and/or artifact in the IP SLA mechanism.

Any ideas?
  



Having never used cisco's IP SLA (or even read about it), take this with 
a sack of salt.


I assume this product works by having a packet with a timestamp sent 
from the source to the destination where it is timestamped again and 
either sent back, or another packet is sent in the other direction.  The 
difference between the two timestamps gives you the latency in that 
direction.


Now, how are your clocks syncronised?  are they synchronised using NTP? 
or something better (GPS?)  If one of your clocks is drifting with 
respect to the other then you'll see this effect.  Does your clock drift 
because NTP is failing to keep the clock well syncronised when it's 
connection to it's parent NTP server is saturated?





Re: Consistent asymetric latency on monitoring?

2009-10-21 Thread Nathan Ward

On 22/10/2009, at 2:31 PM, Perry Lorier wrote:

I assume this product works by having a packet with a timestamp sent  
from the source to the destination where it is timestamped again and  
either sent back, or another packet is sent in the other direction.   
The difference between the two timestamps gives you the latency in  
that direction.


I believe a packet is sent, and the target router responds with a  
timestamp.


But yeah, timestamps are being compared.

I'm with Perry though - sounds like your clocks are drifting.

--
Nathan Ward



Re: Consistent asymetric latency on monitoring?

2009-10-21 Thread Rick Ernst
Resent, since I responded from the wrong address:
---
The basic operation of IP SLA is as surmised; payload with timestamps
and other telemetry data is sent to a 'responder' which manipulates
the payload, including adding its own timestamps, and returns the
altered payload.

I had to do a mental walk-through, but I think I see how drift can
cause this. I'm going to generate some artificial data, graph it, and
see if it matches the general waveshape I'm seeing.

I purposefully have the traffic generators ntp syncing against the
responders. I thought that would keep the clocks more closely in sync.
I don't necessarily care if the time is 'right', just that it's the
same. What kind of difference should I expect if I sync both
generators and responders against the same source, or not sync the
responder? I'm thinking that having one source with constant drift may
be better than both devices trying to walk/correct the time.

Thanks for the input!


On Wed, Oct 21, 2009 at 8:01 PM, Rick Ernst er...@shreddedmail.com wrote:

 Resent, since I responded from the wrong address:
 ---
 The basic operation of IP SLA is as surmised; payload with timestamps
 and other telemetry data is sent to a 'responder' which manipulates
 the payload, including adding its own timestamps, and returns the
 altered payload.

 I had to do a mental walk-through, but I think I see how drift can
 cause this. I'm going to generate some artificial data, graph it, and
 see if it matches the general waveshape I'm seeing.

 I purposefully have the traffic generators ntp syncing against the
 responders. I thought that would keep the clocks more closely in sync.
 I don't necessarily care if the time is 'right', just that it's the
 same. What kind of difference should I expect if I sync both
 generators and responders against the same source, or not sync the
 responder? I'm thinking that having one source with constant drift may
 be better than both devices trying to walk/correct the time.

 Thanks for the input!


 On Wed, Oct 21, 2009 at 7:55 PM, Rick Ernst er...@shreddedmail.comwrote:

 The basic operation of IP SLA is as surmised; payload with timestamps
 and other telemetry data is sent to a 'responder' which manipulates
 the payload, including adding its own timestamps, and returns the
 altered payload.

 I had to do a mental walk-through, but I think I see how drift can
 cause this. I'm going to generate some artificial data, graph it, and
 see if it matches the general waveshape I'm seeing.

 I purposefully have the traffic generators ntp syncing against the
 responders. I thought that would keep the clocks more closely in sync.
 I don't necessarily care if the time is 'right', just that it's the
 same. What kind of difference should I expect if I sync both
 generators and responders against the same source, or not sync the
 responder? I'm thinking that having one source with constant drift may
 be better than both devices trying to walk/correct the time.

 Thanks for the input!


 On Wednesday, October 21, 2009, Nathan Ward na...@daork.net wrote:
  On 22/10/2009, at 2:31 PM, Perry Lorier wrote:
 
 
  I assume this product works by having a packet with a timestamp sent
 from the source to the destination where it is timestamped again and either
 sent back, or another packet is sent in the other direction.  The difference
 between the two timestamps gives you the latency in that direction.
 
 
  I believe a packet is sent, and the target router responds with a
 timestamp.
 
  But yeah, timestamps are being compared.
 
  I'm with Perry though - sounds like your clocks are drifting.
 
  --
  Nathan Ward
 
 





Re: Consistent asymetric latency on monitoring?

2009-10-21 Thread Mikael Abrahamsson

On Wed, 21 Oct 2009, Rick Ernst wrote:

Has anybody seen this type of behavior? We are solidly convinced that we 
are using the proper OIDs and making the proper transformations of the 
data. The two remaining causes appear to be either natural behavior of 
the links and/or artifact in the IP SLA mechanism.


I've been using IP SLA for years (right now under 12.4) and I have not 
seen behaviour that mirrors what you see. I often see one-way latency go 
up without the other way doing so.


You should start by looking in show ip sla (monitor) op and see what 
values you see in the router, that might give you more information 
regarding where the problem might be (your polling system or if the IP SLA 
agent is actually reporting what you see).


--
Mikael Abrahamssonemail: swm...@swm.pp.se