On Jul 15, 2013, at 5:18 PM, Andy Litzinger <andy.litzin...@theplatform.com> 
wrote:

>  I'd like to be able to collect enough relevant data to pinpoint the trouble 
> spot as much as possible so I can take it to the ISPs and request a solution. 
>  The blackouts are so quick that it's impossible to log in and get a trace- 
> hence the desire to automate it.
> 
> I can provide more details off list if helpful- I'm trying not to vilify 
> anyone- especially without copious amounts of data points.
> 
> As a side question, what should my expectation be regarding packet loss when 
> sending packets from point A to point B across multiple providers across the 
> internet?  Is 30 seconds to a minute of blackout between two destinations 
> every couple of weeks par for the course?  My directly connected ISPs offer 
> me an SLA, but what should I reasonably expect from them when one of their 
> upstream peers (or a peer of their peers) has issues?  If this turns out to 
> be BGP reconvergence or similar do I have any options?

I think there are a number of tools available to detect if something is 
happening:

1) iperf (test network/bw usage)
2) owamp (one way ping) - you can use this to detect when reordering or other 
events happen.. this will collect nearly continuious data.  requires good ntp 
references, or accepting you may see skewed data.
3) some other udp/low latency responder.  i've built something of my own that 
does this, i can provide a pointer if you are interested.  i have graphs of my 
connection at home to someplace remote that crosses 3 carriers.  you can see 
the queuing delay increment throughout the day until peak times and taper off 
at night.  no loss, but the increase is quite visible.
4) some vendor SLA/SAA product.  Cisco and others have SAA responders that work 
on their devices you can configure to collect data.

That being said, losing network for 30 seconds once every 2 weeks I would 
expect is fairly common.  Someone will be doing network upgrades/work or there 
will be hardware/transmission error, etc.

30 seconds sounds a lot like bgp convergence, and in older platforms, eg: 
6500/sup720 expect about 8k prefixes/second max to be downloaded into the 
tcam/fib.  with 400k+ prefixes, it takes awhile to pump the tables into the 
forwarding side.

- Jared

Reply via email to