Re: Recent trouble with QUIC?

2015-09-28 Thread Cody Grosskopf
I care about the application layer.


ps. nice work on oxidized!

On Sun, Sep 27, 2015 at 2:16 PM, Saku Ytti  wrote:

> On 25 September 2015 at 16:20, Ca By  wrote:
>
> Hey,
>
> > I remained very disappointed in how google has gone about quic.
> >
> > They are dismissive of network operators concerns (quic protocol list and
> > ietf), cause substantial outages, and have lost a lot of good will in the
> > process
> >
> > Here's your post mortem:
> >
> > RFO: Google unilaterally deployed a non-standard protocol to our
> production
> > environment, driving up helpdesk calls x%
> >
> > After action: block udp 80/443 until production ready and standard
> ratified
> > use deployed.
>
> I find this attitude sad. Internet is about freedom. Google is using
> standard IP and standard UDP over Internet, we, the network engineers
> shouldn't care about application layer. Lot of companies run their own
> protocols on top of TCP and UDP and there is absolutely nothing wrong
> with that. Saying this shouldn't happen and if it does, those packets
> should be dropped is same as saying innovation shouldn't happen.
> Getting new IETF standard L4 protocol will take lot of time, and will
> be much easier if we first have experience on using it, rather than
> build standard and then hope it works without having actual data about
> it.
>
> QUIC, MinimaLT and other options for new PKI based L4 protocol are
> very welcome. They offer compelling benefits
> - mobility, IP address is not your identity (say hello to 'mosh' like
> behaviour for all applications)
> - encryption for all applications
> - helps with buffer bloat (BW estimation and packet pacing)
> - helps with performance/congestion (packet loss estimation and FEC
> for redundant data, so dropped packet can be reconstructed be
> receiver)
> - fixes amplification (response is smaller than request)
> - helps with DoS (proof of work) (QUIC does not have this)
> - low latency session establishment (Especially compared to TLS/HTTP)
>
> I'm sure I've omitted many others.
>
>
> --
>   ++ytti
>


Re: Recent trouble with QUIC?

2015-09-27 Thread Saku Ytti
On 27 September 2015 at 18:38, Lyle Giese  wrote:

> Part of freedom is to minimize the harm and I think that is where the
> parties replying to this thread diverge.  A broken change that causes harm
> should have/could have been tested better before releasing it to the public
> on the Internet.
>
> Or if a bad release is let loose on the Internet, how does Google minimize
> the harm?

How would this be any different by google introducing TCP related
issue in their frontend servers? This is not a protocol issue, this is
QA issue that could impact arbitrary technology. I'd like to say I've
not broken stuff by misunderstanding impact of my changes, but
unfortunately I can't.

-- 
  ++ytti


Re: Recent trouble with QUIC?

2015-09-27 Thread Matthew Kaufman
Maybe Google should return the money you paid for access to their search engine 
and associated free applications during the time it was down.

Matthew Kaufman

(Sent from my iPhone)

> On Sep 27, 2015, at 6:38 PM, Lyle Giese  wrote:
> 
> 
> 
>> On 09/27/15 16:16, Saku Ytti wrote:
>> On 25 September 2015 at 16:20, Ca By  wrote:
>> 
>> Hey,
>> 
>>> I remained very disappointed in how google has gone about quic.
>>> 
>>> They are dismissive of network operators concerns (quic protocol list and
>>> ietf), cause substantial outages, and have lost a lot of good will in the
>>> process
>>> 
>>> Here's your post mortem:
>>> 
>>> RFO: Google unilaterally deployed a non-standard protocol to our production
>>> environment, driving up helpdesk calls x%
>>> 
>>> After action: block udp 80/443 until production ready and standard ratified
>>> use deployed.
>> 
>> I find this attitude sad. Internet is about freedom. Google is using
>> standard IP and standard UDP over Internet, we, the network engineers
>> shouldn't care about application layer. Lot of companies run their own
>> protocols on top of TCP and UDP and there is absolutely nothing wrong
>> with that. Saying this shouldn't happen and if it does, those packets
>> should be dropped is same as saying innovation shouldn't happen.
>> Getting new IETF standard L4 protocol will take lot of time, and will
>> be much easier if we first have experience on using it, rather than
>> build standard and then hope it works without having actual data about
>> it.
>> 
>> QUIC, MinimaLT and other options for new PKI based L4 protocol are
>> very welcome. They offer compelling benefits
>> - mobility, IP address is not your identity (say hello to 'mosh' like
>> behaviour for all applications)
>> - encryption for all applications
>> - helps with buffer bloat (BW estimation and packet pacing)
>> - helps with performance/congestion (packet loss estimation and FEC
>> for redundant data, so dropped packet can be reconstructed be
>> receiver)
>> - fixes amplification (response is smaller than request)
>> - helps with DoS (proof of work) (QUIC does not have this)
>> - low latency session establishment (Especially compared to TLS/HTTP)
>> 
>> I'm sure I've omitted many others.
> 
> There are advantages to QUIC or Google would not be trying to work on it and 
> implement it.
> 
> The problem is that it has been added to a popular application(Chrome) which 
> many/most end users know little to nothing about QUIC and what the 
> implications are when a version in Chrome is defective and harmful to the 
> Internet.
> 
> Part of freedom is to minimize the harm and I think that is where the parties 
> replying to this thread diverge.  A broken change that causes harm should 
> have/could have been tested better before releasing it to the public on the 
> Internet.
> 
> Or if a bad release is let loose on the Internet, how does Google minimize 
> the harm?
> 
> Lyle Giese
> LCR Computer Services, Inc.


Re: Recent trouble with QUIC?

2015-09-27 Thread Lyle Giese



On 09/27/15 16:16, Saku Ytti wrote:

On 25 September 2015 at 16:20, Ca By  wrote:

Hey,


I remained very disappointed in how google has gone about quic.

They are dismissive of network operators concerns (quic protocol list and
ietf), cause substantial outages, and have lost a lot of good will in the
process

Here's your post mortem:

RFO: Google unilaterally deployed a non-standard protocol to our production
environment, driving up helpdesk calls x%

After action: block udp 80/443 until production ready and standard ratified
use deployed.


I find this attitude sad. Internet is about freedom. Google is using
standard IP and standard UDP over Internet, we, the network engineers
shouldn't care about application layer. Lot of companies run their own
protocols on top of TCP and UDP and there is absolutely nothing wrong
with that. Saying this shouldn't happen and if it does, those packets
should be dropped is same as saying innovation shouldn't happen.
Getting new IETF standard L4 protocol will take lot of time, and will
be much easier if we first have experience on using it, rather than
build standard and then hope it works without having actual data about
it.

QUIC, MinimaLT and other options for new PKI based L4 protocol are
very welcome. They offer compelling benefits
- mobility, IP address is not your identity (say hello to 'mosh' like
behaviour for all applications)
- encryption for all applications
- helps with buffer bloat (BW estimation and packet pacing)
- helps with performance/congestion (packet loss estimation and FEC
for redundant data, so dropped packet can be reconstructed be
receiver)
- fixes amplification (response is smaller than request)
- helps with DoS (proof of work) (QUIC does not have this)
- low latency session establishment (Especially compared to TLS/HTTP)

I'm sure I've omitted many others.




There are advantages to QUIC or Google would not be trying to work on it 
and implement it.


The problem is that it has been added to a popular application(Chrome) 
which many/most end users know little to nothing about QUIC and what the 
implications are when a version in Chrome is defective and harmful to 
the Internet.


Part of freedom is to minimize the harm and I think that is where the 
parties replying to this thread diverge.  A broken change that causes 
harm should have/could have been tested better before releasing it to 
the public on the Internet.


Or if a bad release is let loose on the Internet, how does Google 
minimize the harm?


Lyle Giese
LCR Computer Services, Inc.


Re: Recent trouble with QUIC?

2015-09-27 Thread Saku Ytti
On 25 September 2015 at 16:20, Ca By  wrote:

Hey,

> I remained very disappointed in how google has gone about quic.
>
> They are dismissive of network operators concerns (quic protocol list and
> ietf), cause substantial outages, and have lost a lot of good will in the
> process
>
> Here's your post mortem:
>
> RFO: Google unilaterally deployed a non-standard protocol to our production
> environment, driving up helpdesk calls x%
>
> After action: block udp 80/443 until production ready and standard ratified
> use deployed.

I find this attitude sad. Internet is about freedom. Google is using
standard IP and standard UDP over Internet, we, the network engineers
shouldn't care about application layer. Lot of companies run their own
protocols on top of TCP and UDP and there is absolutely nothing wrong
with that. Saying this shouldn't happen and if it does, those packets
should be dropped is same as saying innovation shouldn't happen.
Getting new IETF standard L4 protocol will take lot of time, and will
be much easier if we first have experience on using it, rather than
build standard and then hope it works without having actual data about
it.

QUIC, MinimaLT and other options for new PKI based L4 protocol are
very welcome. They offer compelling benefits
- mobility, IP address is not your identity (say hello to 'mosh' like
behaviour for all applications)
- encryption for all applications
- helps with buffer bloat (BW estimation and packet pacing)
- helps with performance/congestion (packet loss estimation and FEC
for redundant data, so dropped packet can be reconstructed be
receiver)
- fixes amplification (response is smaller than request)
- helps with DoS (proof of work) (QUIC does not have this)
- low latency session establishment (Especially compared to TLS/HTTP)

I'm sure I've omitted many others.


-- 
  ++ytti


Re: Recent trouble with QUIC?

2015-09-27 Thread Alan Buxey
Yes.  Next gen firewalls stop that kind of game  ;)

alan


Re: Recent trouble with QUIC?

2015-09-26 Thread Dovid Bender
I forgot who it was but I think it was a uni network. As an isp everything 
should be allowed as an end network you want to cya. 

Much like the hospital I was just at that had free wifi. Only ports 80 and 443 
over tcp were allowed. That's when having ssh on 443 so you can proxy for alt 
ports really helps.


--Original Message--
From: James Bensley
Sender: NANOG
To: NANOG Operators' Group
Subject: Re: Recent trouble with QUIC?
Sent: Sep 26, 2015 10:54

On 26 September 2015 at 08:20, Mike Hale  wrote:
> OH SNAP!

Tiny Rick!!!

Regards,

Dovid


Re: Recent trouble with QUIC?

2015-09-26 Thread James Bensley
On 26 September 2015 at 08:20, Mike Hale  wrote:
> OH SNAP!

Tiny Rick!!!


Re: Recent trouble with QUIC?

2015-09-26 Thread Mike Hale
OH SNAP!

On Fri, Sep 25, 2015 at 10:07 PM, Matthew Kaufman  wrote:
>
>
> On 9/25/15 5:43 PM, Stephen Satchell wrote:
>>
>> On 09/25/2015 04:20 PM, Ca By wrote:
>>>
>>> RFO: Google unilaterally deployed a non-standard protocol to our
>>> production
>>> environment, driving up helpdesk calls x%
>>>
>>> After action: block udp 80/443 until production ready and standard
>>> ratified
>>> use deployed.
>>
>>
>> Let me be gentle about this.  Why were you allowing 80/udp and 443/udp in
>> the first place into your production environment?
>>
>
> Which ISP do you run that blocks UDP by default? I'm curious, so I can be
> sure I don't buy mislabeled "Internet" service from you.
>
> Matthew Kaufman



-- 
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0


Re: Recent trouble with QUIC?

2015-09-25 Thread Matthew Kaufman



On 9/25/15 5:43 PM, Stephen Satchell wrote:

On 09/25/2015 04:20 PM, Ca By wrote:
RFO: Google unilaterally deployed a non-standard protocol to our 
production

environment, driving up helpdesk calls x%

After action: block udp 80/443 until production ready and standard 
ratified

use deployed.


Let me be gentle about this.  Why were you allowing 80/udp and 443/udp 
in the first place into your production environment?




Which ISP do you run that blocks UDP by default? I'm curious, so I can 
be sure I don't buy mislabeled "Internet" service from you.


Matthew Kaufman


Re: Recent trouble with QUIC?

2015-09-25 Thread Sean Hunter
These are all interesting viewpoints.

Personally, I was only surprised that Google didn't:

A) identify the issue during early rollout (starting Sept 9) when Google
has specifically talked up to the community their tooling for monitoring
QUIC changes

B) catch what seems like a pretty basic bug during Chrome code reviews

C) identify the problem more quickly once they realized that *something*
was wrong (I guess another tooling issue)

D) roll back more quickly (though perhaps identification was really the
delaying factor here)

I do find the anecdote about support amusing, though. Google has always
resisted providing support of any kind; I think it's a culture that comes
from their extremely strong engineering history where needing support is
viewed as a failure of the engineering and product teams.

Recovery times could probably be improved if they had a help desk, but I'm
not sure customer satisfaction would be improved in any significantly value
adding way.

The lesson I walked away with is that if you don't want QUIC on your
network, don't allow it. At my institution, I think we view this the same
way we'd view a problem with any website; we're only responsible for making
sure your packets are flowing out to the internet and back.

Finally, thanks to all who responded. It's been an informative experience.
On Sep 25, 2015 7:45 PM, "Stephen Satchell"  wrote:

> On 09/25/2015 04:20 PM, Ca By wrote:
>
>> RFO: Google unilaterally deployed a non-standard protocol to our
>> production
>> environment, driving up helpdesk calls x%
>>
>> After action: block udp 80/443 until production ready and standard
>> ratified
>> use deployed.
>>
>
> Let me be gentle about this.  Why were you allowing 80/udp and 443/udp in
> the first place into your production environment?
>
> In my network, I run a mostly-closed firewall, only allowing those ports
> that are needed to be forwarded between the inside and outside networks.
>
> I don't have -- or need -- a DMZ here at this time, so I don't have to
> worry about that side of the routing triangle.  If I did, I would also run
> mostly closed between inside/outside and the DMZ.
>
> I'm liberal about opening ports on request, but the ports have to be
> requested before I'll allow them in, out, or forwarded.
>


Re: Recent trouble with QUIC?

2015-09-25 Thread Stephen Satchell

On 09/25/2015 04:20 PM, Ca By wrote:

RFO: Google unilaterally deployed a non-standard protocol to our production
environment, driving up helpdesk calls x%

After action: block udp 80/443 until production ready and standard ratified
use deployed.


Let me be gentle about this.  Why were you allowing 80/udp and 443/udp 
in the first place into your production environment?


In my network, I run a mostly-closed firewall, only allowing those ports 
that are needed to be forwarded between the inside and outside networks.


I don't have -- or need -- a DMZ here at this time, so I don't have to 
worry about that side of the routing triangle.  If I did, I would also 
run mostly closed between inside/outside and the DMZ.


I'm liberal about opening ports on request, but the ports have to be 
requested before I'll allow them in, out, or forwarded.


Re: Recent trouble with QUIC?

2015-09-25 Thread chris
This reminds me of something I ran into where I came to a similar
conclusion.

We had a customer who used google ad and docs products very heavily and all
of a sudden they started getting captchas on accessing any google property.

When we reached out to google we were told that they were "blacklisted"
based on suspicious search queries or some kind of query manipulation that
they believe was caused by malware.
We search high and low internally and could not find anything and asked
them to provide specifics about what they saw and they would not and then
we tried to monitor network traffic we realized that google had just
implemented SSL search as a default so we could not easily inspect the
search traffic without putting in infrastructure that could do MITM and
allow us to inspect (which we also suspected doing this could have serious
blowback)

At the end of the day the customer was extremely frustrated because they
used google apps for their entire business and google insisted it was on
their end but we couldnt not get any factual evidence and we would have had
to do some really questionable things to try to go at debugging it on our
own.

TLDR, customer eventually bailed on all their google products because it
scared them and reaching a human at google through regular channels was
near impossible except through mazes of filling out forms and waiting 24hrs
per  email response. Even when we were able to connect with a fellow
googler on nanog who tried to be helpful even though he wasnt on the right
team we still got nowhere

This is really the dark side of the "cloud" (no pun intended), when a
company makes some kind of change or an event occurs with no communication
and it backfires. Even the most basic advanced notifications or just having
proper support available when a change occurs can be more important than
the technical aspects.

chris

On Fri, Sep 25, 2015 at 7:20 PM, Ca By  wrote:

> On Friday, September 25, 2015, Cody Grosskopf  > wrote:
>
> > a) yes, 56,000 students and any on Chrome failed. I immediately blocked
> > quic and told users to restart Chrome. Luckily the fallback to good ol'
> tcp
> > saved the day.
> >
> > b) I had this issue a few months ago and it subsided quickly
> >
> > Google reports it's an issue in this version of Chrome and the next
> version
> > will have a little smarts to automatically re initiate the connection
> with
> > TCP automatically without having to disable quic.
> >
> >
> I remained very disappointed in how google has gone about quic.
>
> They are dismissive of network operators concerns (quic protocol list and
> ietf), cause substantial outages, and have lost a lot of good will in the
> process
>
> Here's your post mortem:
>
> RFO: Google unilaterally deployed a non-standard protocol to our production
> environment, driving up helpdesk calls x%
>
> After action: block udp 80/443 until production ready and standard ratified
> use deployed.
>
> And.
>
> Get off my lawn.
>
>
>
> On Wed, Sep 23, 2015 at 5:01 PM, Sean Hunter  wrote:
> >
> > > Hi all,
> > >
> > > I work for a 2500 user university and we've seen some odd behavior
> > > recently. 2-4 weeks ago we started seeing Google searches that would
> fail
> > > for ~2 minutes, or disconnects in Gmail briefly. This week, and
> > > particularly in the last 2-3 days, we've had reports from numerous
> users
> > on
> > > campus, even those who generally do not complain unless an issue has
> been
> > > ongoing for a while. Those reports include Drive disconnecting,
> searches
> > > failing, Gmail presenting a "007" error, and calendar failing to create
> > > events.
> > >
> > > In fact, the issue became so widespread today, that the campus paper is
> > > writing about it as a last minute article before they're weekly
> > > publication's deadline this evening. (Important in our little world
> where
> > > we try to look good.)
> > >
> > > We aren't really staffed or equipped to figure out exactly what's
> > happening
> > > (and issues are sporadic, so packet captures are difficult, to say the
> > > least), but we found that disabling QUIC dramatically and immediately
> > > improved the experience of a couple of users on campus. We're
> > recommending
> > > via the paper that others do so as well.
> > >
> > > What I'm curious about is:
> > >
> > > a) Has anyone here had a similar experience? Was the root cause QUIC in
> > > your case?
> > >
> > > b) Has anyone noticed anything remotely similar in the last few
> > > weeks/days/today?
> > >
> > > We're an Apps domain, so this may be specific to universities in the
> Apps
> > > universe.
> > >
> > > If anyone has any useful information or hints, or if someone from
> Google
> > > would like more information, please feel free to contact me, on or off
> > > list.
> > >
> > > Thanks for reading and have a great night everyone! Happy Wednesday!
> > >
> >
>


Recent trouble with QUIC?

2015-09-25 Thread Ca By
On Friday, September 25, 2015, Cody Grosskopf > wrote:

> a) yes, 56,000 students and any on Chrome failed. I immediately blocked
> quic and told users to restart Chrome. Luckily the fallback to good ol' tcp
> saved the day.
>
> b) I had this issue a few months ago and it subsided quickly
>
> Google reports it's an issue in this version of Chrome and the next version
> will have a little smarts to automatically re initiate the connection with
> TCP automatically without having to disable quic.
>
>
I remained very disappointed in how google has gone about quic.

They are dismissive of network operators concerns (quic protocol list and
ietf), cause substantial outages, and have lost a lot of good will in the
process

Here's your post mortem:

RFO: Google unilaterally deployed a non-standard protocol to our production
environment, driving up helpdesk calls x%

After action: block udp 80/443 until production ready and standard ratified
use deployed.

And.

Get off my lawn.



On Wed, Sep 23, 2015 at 5:01 PM, Sean Hunter  wrote:
>
> > Hi all,
> >
> > I work for a 2500 user university and we've seen some odd behavior
> > recently. 2-4 weeks ago we started seeing Google searches that would fail
> > for ~2 minutes, or disconnects in Gmail briefly. This week, and
> > particularly in the last 2-3 days, we've had reports from numerous users
> on
> > campus, even those who generally do not complain unless an issue has been
> > ongoing for a while. Those reports include Drive disconnecting, searches
> > failing, Gmail presenting a "007" error, and calendar failing to create
> > events.
> >
> > In fact, the issue became so widespread today, that the campus paper is
> > writing about it as a last minute article before they're weekly
> > publication's deadline this evening. (Important in our little world where
> > we try to look good.)
> >
> > We aren't really staffed or equipped to figure out exactly what's
> happening
> > (and issues are sporadic, so packet captures are difficult, to say the
> > least), but we found that disabling QUIC dramatically and immediately
> > improved the experience of a couple of users on campus. We're
> recommending
> > via the paper that others do so as well.
> >
> > What I'm curious about is:
> >
> > a) Has anyone here had a similar experience? Was the root cause QUIC in
> > your case?
> >
> > b) Has anyone noticed anything remotely similar in the last few
> > weeks/days/today?
> >
> > We're an Apps domain, so this may be specific to universities in the Apps
> > universe.
> >
> > If anyone has any useful information or hints, or if someone from Google
> > would like more information, please feel free to contact me, on or off
> > list.
> >
> > Thanks for reading and have a great night everyone! Happy Wednesday!
> >
>


Re: Recent trouble with QUIC?

2015-09-25 Thread Cody Grosskopf
a) yes, 56,000 students and any on Chrome failed. I immediately blocked
quic and told users to restart Chrome. Luckily the fallback to good ol' tcp
saved the day.

b) I had this issue a few months ago and it subsided quickly

Google reports it's an issue in this version of Chrome and the next version
will have a little smarts to automatically re initiate the connection with
TCP automatically without having to disable quic.

On Wed, Sep 23, 2015 at 5:01 PM, Sean Hunter  wrote:

> Hi all,
>
> I work for a 2500 user university and we've seen some odd behavior
> recently. 2-4 weeks ago we started seeing Google searches that would fail
> for ~2 minutes, or disconnects in Gmail briefly. This week, and
> particularly in the last 2-3 days, we've had reports from numerous users on
> campus, even those who generally do not complain unless an issue has been
> ongoing for a while. Those reports include Drive disconnecting, searches
> failing, Gmail presenting a "007" error, and calendar failing to create
> events.
>
> In fact, the issue became so widespread today, that the campus paper is
> writing about it as a last minute article before they're weekly
> publication's deadline this evening. (Important in our little world where
> we try to look good.)
>
> We aren't really staffed or equipped to figure out exactly what's happening
> (and issues are sporadic, so packet captures are difficult, to say the
> least), but we found that disabling QUIC dramatically and immediately
> improved the experience of a couple of users on campus. We're recommending
> via the paper that others do so as well.
>
> What I'm curious about is:
>
> a) Has anyone here had a similar experience? Was the root cause QUIC in
> your case?
>
> b) Has anyone noticed anything remotely similar in the last few
> weeks/days/today?
>
> We're an Apps domain, so this may be specific to universities in the Apps
> universe.
>
> If anyone has any useful information or hints, or if someone from Google
> would like more information, please feel free to contact me, on or off
> list.
>
> Thanks for reading and have a great night everyone! Happy Wednesday!
>


Re: Recent trouble with QUIC?

2015-09-24 Thread Todd Underwood
This has now been resolved. See recent post by ian swett in a separate
thread about quic.

T
On Sep 24, 2015 1:12 AM, "Mike Meredith"  wrote:

> On Wed, 23 Sep 2015 19:01:19 -0500, Sean Hunter 
> may have written:
> > a) Has anyone here had a similar experience? Was the root cause QUIC
> > in your case?
>
> Yes. No; in our case our firewall (a PA5060 running PANOS6.1.3 at the
> time) was allowing some QUIC packets through, but not others. As it was
> newly deployed at the time, it was soon blamed :-\
>
> > b) Has anyone noticed anything remotely similar in the last few
> > weeks/days/today?
>
> Only because I enabled QUIC within Chrome on our test network to verify
> that it was still a problem.
>
> > We're an Apps domain, so this may be specific to universities in the
> > Apps universe.
>
> As are we.
>
> --
> Mike Meredith, University of Portsmouth
> Principal Systems Engineer, Hostmaster, Security, and Timelord!
>
>


Re: Recent trouble with QUIC

2015-09-24 Thread Ian Swett via NANOG
Hi,

I'm an engineer working on QUIC at Google.

Sorry for the delayed response. This was an issue with QUIC traffic (from
Chrome users) to Google for some users behind NATs. The issue started at
small volumes on 2015-09-09 when we started rolling out an optimization to
our frontends for QUIC users behind NATs, and then increased in magnitude
on 2015-09-22. We rolled back the feature by 2015-09-23 12:40 PDT, which
did clear all issues.

Here are the technical details:

As we have posted about QUIC before[1], and was pointed out earlier in this
thread, QUIC runs over UDP. For users behind a NAT, QUIC relies upon NATs
maintaining port bindings when the UDP path is in active use.  To ensure
that the NAT does not time out with outstanding requests, QUIC sends a
keepalive ping after 15 seconds.  To optimize for the case when a NAT times
out, we implemented a feature in our frontend to allow the port to migrate
on open QUIC connections -- the frontend would send a GOAWAY to the client
to indicate the connection should be re-established, but outstanding
requests could be completed.  This triggered a previously unknown bug in
Chrome where it tries to use an open QUIC connection that has already
received a GOAWAY for new requests.  Hence the failures.

As this happens only following a NAT rebinding, these failures were
isolated to networks with a lot of NAT rebindings on active connections.
This bug was particularly visible on connections to GMail and Drive because
they utilize hanging GETs which last multiple minutes.  All new requests
using timed out connections would have immediately failed as well.

After rolling back the feature, we have started working on fixing the
underlying issue and improving how we detect and avoid issues like this in
the future.

-- Ian Swett

[1] http://blog.chromium.org/2013/06/experimenting-with-quic.html


Re: Recent trouble with QUIC?

2015-09-24 Thread Mike Meredith
On Wed, 23 Sep 2015 19:01:19 -0500, Sean Hunter 
may have written:
> a) Has anyone here had a similar experience? Was the root cause QUIC
> in your case?

Yes. No; in our case our firewall (a PA5060 running PANOS6.1.3 at the
time) was allowing some QUIC packets through, but not others. As it was
newly deployed at the time, it was soon blamed :-\

> b) Has anyone noticed anything remotely similar in the last few
> weeks/days/today?

Only because I enabled QUIC within Chrome on our test network to verify
that it was still a problem. 

> We're an Apps domain, so this may be specific to universities in the
> Apps universe.

As are we.

-- 
Mike Meredith, University of Portsmouth
Principal Systems Engineer, Hostmaster, Security, and Timelord!
 


Re: Recent trouble with QUIC?

2015-09-23 Thread Roland Dobbins


On 24 Sep 2015, at 7:01, Sean Hunter wrote:


If anyone has any useful information or hints


I wonder if large-scale QoS and/or ACLing being done at some ISP edges 
in response to UDP reflection/amplification attacks may be a factor?


It's not very smart of those working on QUIC to've thrown it into the 
UDP cesspit, precisely because of the possibility of this sort of thing.


I have zero evidence this what's taking place in the OP's case, mind - 
but it's something to investigate, and ought to be at least somewhat 
inferable via packet captures and/or flow telemetry analysis.


---
Roland Dobbins 


Re: Recent trouble with QUIC?

2015-09-23 Thread Ca By
On Wednesday, September 23, 2015, Sean Hunter  wrote:

> Hi all,
>
> I work for a 2500 user university and we've seen some odd behavior
> recently. 2-4 weeks ago we started seeing Google searches that would fail
> for ~2 minutes, or disconnects in Gmail briefly. This week, and
> particularly in the last 2-3 days, we've had reports from numerous users on
> campus, even those who generally do not complain unless an issue has been
> ongoing for a while. Those reports include Drive disconnecting, searches
> failing, Gmail presenting a "007" error, and calendar failing to create
> events.
>
> In fact, the issue became so widespread today, that the campus paper is
> writing about it as a last minute article before they're weekly
> publication's deadline this evening. (Important in our little world where
> we try to look good.)
>
> We aren't really staffed or equipped to figure out exactly what's happening
> (and issues are sporadic, so packet captures are difficult, to say the
> least), but we found that disabling QUIC dramatically and immediately
> improved the experience of a couple of users on campus. We're recommending
> via the paper that others do so as well.
>
> What I'm curious about is:
>
> a) Has anyone here had a similar experience? Was the root cause QUIC in
> your case?
>
> b) Has anyone noticed anything remotely similar in the last few
> weeks/days/today?
>
> We're an Apps domain, so this may be specific to universities in the Apps
> universe.
>
> If anyone has any useful information or hints, or if someone from Google
> would like more information, please feel free to contact me, on or off
> list.
>
> Thanks for reading and have a great night everyone! Happy Wednesday!
>

Be believe you can safely block udp port 443 and 80 outbound safely if you
need a solution that scales better.

This will trigger quic to fall back to tcp

CB


Re: Recent trouble with QUIC?

2015-09-23 Thread Benson Schliesser
Hi, Sean.

I had precisely this experience, mostly noticed just in the past day or so.
I assumed it was an effect of the firewall/NAT setup that my corporate IT
network has implemented, because it often is a culprit in these kind of
situations... But noticing that it was only for QUIC connections to Google
(I likewise use Apps hosted email) I just turned off QUIC in Chrome and the
problem went away.

I have no real insight about the issue beyond that. Except to say that the
packets captures I performed were not very useful to me, personally,
because encrypted QUIC traffic isn't very revealing in Wireshark. :) Though
I may simply be missing some clue and/or skill in making sense of it.

I guess I'm glad to hear that others such as yourself are seeing the same
problem. Because now I don't have to harass my corporate IT dept. Instead
we can look forward to speculation about Google, QUIC adoption in the near
future, etc.

Cheers,
-Benson


On Wednesday, September 23, 2015, Sean Hunter  wrote:

> Hi all,
>
> I work for a 2500 user university and we've seen some odd behavior
> recently. 2-4 weeks ago we started seeing Google searches that would fail
> for ~2 minutes, or disconnects in Gmail briefly. This week, and
> particularly in the last 2-3 days, we've had reports from numerous users on
> campus, even those who generally do not complain unless an issue has been
> ongoing for a while. Those reports include Drive disconnecting, searches
> failing, Gmail presenting a "007" error, and calendar failing to create
> events.
>
> In fact, the issue became so widespread today, that the campus paper is
> writing about it as a last minute article before they're weekly
> publication's deadline this evening. (Important in our little world where
> we try to look good.)
>
> We aren't really staffed or equipped to figure out exactly what's happening
> (and issues are sporadic, so packet captures are difficult, to say the
> least), but we found that disabling QUIC dramatically and immediately
> improved the experience of a couple of users on campus. We're recommending
> via the paper that others do so as well.
>
> What I'm curious about is:
>
> a) Has anyone here had a similar experience? Was the root cause QUIC in
> your case?
>
> b) Has anyone noticed anything remotely similar in the last few
> weeks/days/today?
>
> We're an Apps domain, so this may be specific to universities in the Apps
> universe.
>
> If anyone has any useful information or hints, or if someone from Google
> would like more information, please feel free to contact me, on or off
> list.
>
> Thanks for reading and have a great night everyone! Happy Wednesday!
>


Recent trouble with QUIC?

2015-09-23 Thread Sean Hunter
Hi all,

I work for a 2500 user university and we've seen some odd behavior
recently. 2-4 weeks ago we started seeing Google searches that would fail
for ~2 minutes, or disconnects in Gmail briefly. This week, and
particularly in the last 2-3 days, we've had reports from numerous users on
campus, even those who generally do not complain unless an issue has been
ongoing for a while. Those reports include Drive disconnecting, searches
failing, Gmail presenting a "007" error, and calendar failing to create
events.

In fact, the issue became so widespread today, that the campus paper is
writing about it as a last minute article before they're weekly
publication's deadline this evening. (Important in our little world where
we try to look good.)

We aren't really staffed or equipped to figure out exactly what's happening
(and issues are sporadic, so packet captures are difficult, to say the
least), but we found that disabling QUIC dramatically and immediately
improved the experience of a couple of users on campus. We're recommending
via the paper that others do so as well.

What I'm curious about is:

a) Has anyone here had a similar experience? Was the root cause QUIC in
your case?

b) Has anyone noticed anything remotely similar in the last few
weeks/days/today?

We're an Apps domain, so this may be specific to universities in the Apps
universe.

If anyone has any useful information or hints, or if someone from Google
would like more information, please feel free to contact me, on or off list.

Thanks for reading and have a great night everyone! Happy Wednesday!