Re: Hurricane Electric packet loss

2014-07-22 Thread Tim Heckman
Hey Wolfgang,

I believe I may be seeing similar behavior but it's hard for me to
confirm. My network configuration is one that mtr doesn't support, so
I can't get a report when we're having issues. I don't have my transit
provided directly from HE, but rather through a provider who colocates
out of one of their facilities. So I'm not sure I could even directly
reach out to the Hurricane Electric NOC to get help.

We've been seeing the odd connectivity issues between HE FMT2 (Linode)
and AWS US-WEST-1 and US-WEST-2. It's a mixed combination of loss and
increased latency, both which cause some hiccups in some of our
WAN-based clusters. There have been times where the issues we've seen
have been attributed to a DoS attack directed toward a Linode
customer, but there have been quite a few networking events that seem
to have no relation to a known attack.

Thanks for reaching out to NANOG with this issue, it may have shed
some light on some of the issues we are seeing.

Cheers!
-Tim

On Tue, Jul 22, 2014 at 2:48 AM, Wolfgang Nagele (AusRegistry)
wolfgang.nag...@ausregistry.com.au wrote:
 Hi,

 We’ve been customers of Hurricane Electric for a number of years now and 
 always been happy with their service.

 In recent months packet loss on some of their major routes has become a very 
 common (every few days) occurrence. Without knowledge of their network I am 
 unsure what’s the cause of it but we’ve seen it on the Tokyo - US routes as 
 well as the London - US routes. It reminds me of the Cogent expansion which 
 was carried out by unsustainable oversubscription which eventually resulted 
 in unusable service for a number of years. Having seen some of the rates that 
 HE has been selling for I can’t help but wonder if they made the same mistake 
 ...

 Here is an example of what’s going on again atm.
 HOST: prolocation01.ring.nlnog.ne Loss%   Snt   Last   Avg  Best  Wrst StDev
   1.|-- 2a00:d00:ff:136::253   0.0%110.3   0.3   0.3   0.4   0.0
   2.|-- 2a00:d00:1:12::1   0.0%100.7   0.8   0.7   1.1   0.1
   3.|-- hurricane-electric.nikhef  0.0%100.7   3.1   0.7   8.3   2.9
   4.|-- 100ge9-1.core1.lon2.he.ne  0.0%109.8  12.6   8.0  19.2   4.1
   5.|-- 100ge1-1.core1.nyc4.he.ne 10.0%10   74.7  74.6  73.7  80.8   2.3
   6.|-- 10ge10-3.core1.lax1.he.ne 30.0%10  133.4 138.0 133.4 145.1   4.8
   7.|-- 10ge1-3.core1.lax2.he.net 20.0%10  135.7 139.1 133.4 145.1   4.5
   8.|-- 2001:504:13::3b   40.0%10  143.2 143.1 142.1 144.4   0.8
   9.|-- 2402:7800:100:1::55   50.0%10  144.4 144.1 143.8 144.4   0.2
  10.|-- 2402:7800:0:1::f6 60.0%10  298.7 298.4 298.2 298.7   0.2
  11.|-- ge-0-1-4.cor02.syd03.nsw. 10.0%10  299.3 298.9 298.3 299.5   0.5
  12.|-- 2402:7800:0:2::18a20.0%10  299.7 299.4 298.9 300.1   0.4
  13.|-- 2001:dcd:12::10   30.0%10  299.8 299.5 298.8 300.0   0.5

 Is anybody else observing this as well?

 Cheers,
 Wolfgang


Re: Connectivity issue between Verizon and Amazon EC2

2014-07-21 Thread Tim Heckman
I am seeing the same issue between AWS US-WEST 2 and Hurricane Electric's
Fremont 2 location (Linode). Looks to be deep within Amanzon's network
based on changes in latency in a simple trace route.

I would provide an mtr, however my network configuration is something mtr
doesn't support.

Cheers!
-Tim

On Jul 21, 2014 8:44 PM, Roland Dobbins rdobb...@arbor.net wrote:


 On Jul 22, 2014, at 10:31 AM, Ray Van Dolson rvandol...@esri.com wrote:

  We're seeing poor performance (very slow download speeds -- 
100KB/sec) to certain EC2 instances via our Verizon hosted circuits.

 Have you tried dorking around with your MTU to see if that makes a
difference?

 --
 Roland Dobbins rdobb...@arbor.net // http://www.arbornetworks.com

Equo ne credite, Teucri.

   -- Laocoön



Re: Connectivity issue between Verizon and Amazon EC2

2014-07-21 Thread Tim Heckman
Realized I sent the reply to Roland. Apologies.

Here it is in full:



I am seeing the same issue between AWS US-WEST 2 and Hurricane Electric's
Fremont 2 location (Linode). Looks to be deep within Amanzon's network
based on changes in latency in a simple trace route.

I would provide an mtr, however my network configuration is something mtr
doesn't support.

Cheers!
-Tim
On Jul 21, 2014 8:34 PM, Ray Van Dolson rvandol...@esri.com wrote:

 I'm short some important details on this one, but hopefully can fill in
 more shortly.

 We're seeing poor performance (very slow download speeds -- 
 100KB/sec) to certain EC2 instances via our Verizon hosted circuits.
 The issue is reproducible on both our production Gigabit circuit as
 well as a consumer grade Verizion FIOS line.

 Speeds are normal (10MB/sec plus) via non-Verizon circuits we've
 tested.

 Source IP's are in the 198.102.62.0/24 range and destination on the EC2
 side is 54.197.239.228.  I'm not sure in which availability zone the
 latter IP sits, but hope to find out shortly.

 MTR traceroute details are as follows:

 Host   Loss%   Snt Drop   Avg
  Best  Wrst StDev
 1. 198.102.62.253   0.0%   5260   0.2
   0.2   0.5   0.0
 2. 152.179.250.141  0.0%   5260  14.1
   7.0  19.4   3.6
 3. 140.222.225.135 37.5%   526  197   7.7
   6.8  35.8   1.9
 4. 129.250.8.85 0.0%   5260   8.1
   7.4  11.7   0.3
 5. 129.250.2.229   10.3%   525   54  11.4
   7.1  85.7   9.6
 6. 129.250.2.169   41.5%   525  218  63.0
  45.5 130.7  10.3
 7. 129.250.2.1540.2%   5251  59.9
  44.5  69.0   4.0
 8. ???
 9. 54.240.229.967.8%   525   41  76.6
  71.3 119.9   8.6
 54.240.229.104
 54.240.229.106
 10. 54.240.229.2 6.9%   525   36  74.7
  71.6 109.1   4.9
 54.240.229.4
 54.240.229.20
 54.240.229.8
 54.240.229.14
 54.240.228.254
 54.240.229.16
 54.240.229.10
 11. 54.240.229.174   5.5%   525   29  76.0
  71.7 109.0   7.3
 54.240.229.162
 54.240.229.160
 54.240.229.170
 54.240.229.172
 54.240.229.168
 54.240.229.164
 12. 54.240.228.167  94.5%   525  495  76.4
  71.7 126.0  11.6
 54.240.228.169
 54.240.228.165
 54.240.228.163
 13. 72.21.220.1085.1%   525   27  75.2
  71.3 112.6   6.8
 205.251.244.12
 72.21.220.8
 205.251.244.64
 72.21.220.96
 205.251.244.8
 72.21.220.6
 205.251.244.4
 14. 72.21.220.45 9.0%   525   47  74.0
  71.6 199.5   8.5
 72.21.220.149
 72.21.220.29
 72.21.220.125
 72.21.220.37
 72.21.220.61
 72.21.220.2
 72.21.220.69
 15. 72.21.222.3310.5%   525   55  73.4
  71.5  87.1   1.5
 205.251.245.65
 72.21.222.149
 72.21.222.35
 72.21.220.29
 72.21.222.131
 72.21.222.147
 72.21.220.37
 16. 205.251.245.65  93.9%   525  492  73.1
  72.2  76.2   1.2
 72.21.222.35
 72.21.222.131
 17. ???
 18. ???
 19. 216.182.224.79  13.5%   524   71  77.9
  72.4 101.2   5.4
 216.182.224.81
 216.182.224.95
 216.182.224.77
 20. 216.182.224.81  94.1%   524  492  77.9
  72.8  93.0   6.3
 216.182.224.95
 216.182.224.77
 21. ???

 The 140.222.225.135 shows up in the traceroutes via our Verizon
 Business FIOS line as well.

 Will be opening a ticket with both Verizon and AWS to assist, but
 hoping someone out there can take a look or chime in.  Feel free to
 reply off list.

 Thanks,
 Ray



Re: Erroneous Leap Second Introduced at 2014-06-30 23:59:59 UTC

2014-07-01 Thread Tim Heckman
On Mon, Jun 30, 2014 at 7:27 PM, Majdi S. Abbas m...@latt.net wrote:
 On Mon, Jun 30, 2014 at 05:33:52PM -0700, Tim Heckman wrote:
 I just was alerted to one of the systems I managed having a time skew
 greater than 100ms from NTP sources. Upon further investigation it
 seemed that the time was off by almost exactly 1 second.

 Looking back over our NTP monitoring, it would appear that this system
 had a large time adjust at approximately 00:00 UTC:

 Okay.  Do you have any logging configured (peerstats, etc?) for
 ntpd?

Our systems all have loopstats and peerstats logging enabled. I have
those log files available if interested. However, when I searched over
the files I wasn't able to find anything that seemed to indicate this
was the peer who told the system to introduce a leap second. That
said, I might just not know what to look for in the logs.

 A few of our systems did alert early this morning, indicating they
 were going to be receiving a leap second today. However, I was unable
 to determine the exact cause for NTP believing a leap second should be
 added. And after some time a few of the systems were no longer
 indicating that a leap second would be introduced.

 This can happen if a server is either passing along a leap
 notification that it received, or is configured to use a leapseconds
 file that is incorrect.

Correct, I was hoping to determine which peer it was so I can reach
out to them to make sure this doesn't bleed in to the pool at the end
of the year. I was also more-or-less curious how wide-spread of an
issue this was, but I'm starting to think I may have been the only
person to catch it in the act. :)

 This specific system is hosted in AWS US-WEST-2C and uses the
 0.amazon.pool.ntp.org pool.

 0 is just one server in the pool (whichever you draw by
 rotation); is this the only server you have configured?

We use 0.amazon.pool.ntp.org, 1.amazon.pool.ntp.org, and
2.amazon.pool.ntp.org. As with the other widely-used pool hostnames,
each of these is a round-robin DNS entry with 4 hosts and a TTL of
150s.

 --msa

Thank you for getting back to me.

Cheers!
-Tim


Re: Erroneous Leap Second Introduced at 2014-06-30 23:59:59 UTC

2014-07-01 Thread Tim Heckman
On Tue, Jul 1, 2014 at 12:35 PM, Majdi S. Abbas m...@latt.net wrote:
 On Tue, Jul 01, 2014 at 12:20:12PM -0700, Tim Heckman wrote:
 Our systems all have loopstats and peerstats logging enabled. I have
 those log files available if interested. However, when I searched over
 the files I wasn't able to find anything that seemed to indicate this
 was the peer who told the system to introduce a leap second. That
 said, I might just not know what to look for in the logs.

 Look at the status word in peerstats; if the high bit is
 set, that's your huckleberry.

 See: http://www.eecis.udel.edu/~mills/ntp/html/decode.html

I've taken a look at all of the peerstats available for this host, and
surprisingly none of them are showing code 09 (leap_armed). I'm also
fairly certain that I know when some of my systems armed the leap
second (within a 60-120s window) based on our monitoring. Around those
times everything seems normal according to peerstats. Looking at

I am running Ubuntu 10.04 on this box, which is ntp v4.2.4p8. I'll
need to looking to see if the printing of this flag was added later;
otherwise, it would seem some of my systems picked up a phantom leap
second from an unknown source with one of them actually executing it.

Thanks for the decoder ring. My Google-fu wasn't hitting the right keywords.

 Correct, I was hoping to determine which peer it was so I can reach
 out to them to make sure this doesn't bleed in to the pool at the end
 of the year. I was also more-or-less curious how wide-spread of an
 issue this was, but I'm starting to think I may have been the only
 person to catch it in the act. :)

 You might want to upgrade to current 4.2.7 development code,
 wherein a majority rule is used to qualify the leap indicator.

We're going to be doing some system refreshes coming soon, so that may
be something we'll need to look at. I didn't realize this was
happening as part of the 4.2.7 development branch. Definitely an
interesting feature, especially after this. :p

 Cheers,

 --msa

Thanks again, Majdi.

Cheers!
-Tim


Erroneous Leap Second Introduced at 2014-06-30 23:59:59 UTC

2014-06-30 Thread Tim Heckman
Hey Everyone,

I just was alerted to one of the systems I managed having a time skew
greater than 100ms from NTP sources. Upon further investigation it
seemed that the time was off by almost exactly 1 second.

Looking back over our NTP monitoring, it would appear that this system
had a large time adjust at approximately 00:00 UTC:

- http://puu.sh/9Rs6O/a514ad7c97.png (times are in Pacific in these
graphs, sorry about that)

A few of our systems did alert early this morning, indicating they
were going to be receiving a leap second today. However, I was unable
to determine the exact cause for NTP believing a leap second should be
added. And after some time a few of the systems were no longer
indicating that a leap second would be introduced.

This specific system is hosted in AWS US-WEST-2C and uses the
0.amazon.pool.ntp.org pool.

Has anyone else seen any erroneous leap seconds being added to their system?

Cheers!
-Tim Heckman


Packets dropped passing from Qwest to Verizon

2011-11-15 Thread Tim Heckman
Hello,

I'm looking looking for a POC at Qwest (AS209) or Verizon (AS701) to help 
diagnose what looks like a stale bogon filter.  The packets drop where Qwest 
(63.146.26.210) peers with Verizon (152.63.2.130).  

Thanks in advance!

Regards,
Tim H.