Re: optimizing TLS time to first byte

2014-02-12 Thread Ilya Grigorik
On Wed, Feb 12, 2014 at 8:48 AM, Willy Tarreau  wrote:

> Hi Ilya,
>
> On Wed, Feb 12, 2014 at 08:36:20AM +0100, Willy Tarreau wrote:
> > > One last set of followup question on configuration and defaults:
> > > - we allow the user to tune buffer sizes - that's great.
> > > - we allow the user to adjust record sizes: assuming above logic is in
> > > place, can we change the default size to start small by default?
> >
> > I'd rather not do it, at least now. The optimal small size will depend
> > on the MSS and most likely on the ciphers. I'd fear that with a default
> > small size, some users would experience a nasty behaviour with something
> > like two small packets and a third almost empty one. When you send that
> > to certain windows hosts, you can be subject to a 200ms pause because
> > even if the last segment contains a PUSH flag. This could cause more
> > questions here on the list. I'd rather document it or post some articles
> > showing the difference in performance based on such settings, just like
> > you do all the time. After all it's a global setting, so it's not hard
> > to set once for all. Maybe if in the long run we see everybody set it
> > to a similar value, we'll finally change the default setting. What I
> > can do however is to add a build setting to force the default value,
> > just like we do with the buffer size. That way you can update your
> > package and deploy an "optimal-by-default" version :-)
>
> OK so I've done all this. Now you can set the default SSL maxrecord to
> a smaller one at build time using DEFAULT_SSL_MAX_RECORD. Similarly,
> you can set the idle timer using "tune.idletimer" in the global section,
> it defaults to 1 second, and you can change this default at build time
> using DEFAULT_IDLE_TIMER.
>
> All of this was just pushed.
>

Woohoo! Big kudos to Emeric and yourself for all the hard work here.
Looking forward to seeing this out in the wild! :-)

ig


Re: optimizing TLS time to first byte

2014-02-11 Thread Ilya Grigorik
Hey Willy. This is great work, thanks!

On Sat, Feb 8, 2014 at 11:49 PM, Willy Tarreau  wrote:
>
> The observations and analysis are interesting, and the choice of solutions
> is not easy. I think that relying on a pause only is dangerous because
> there
> are generally a number of losses when delivering to clients, causing pauses
> during the transfers, and we wouldn't want to reset the window in this
> case,
> TCP congestion control already handles this.
>

Right, that makes sense.

I reran the same test page against SPDY3.1 and HTTP backends:

SPDY/3.1
-
http://www.webpagetest.org/result/140211_KF_2257217df0d1fe65353b8e8aa415de9b/3/details/
- http://cloudshark.org/captures/c8eab73137e5?filter=tcp.stream%3D%3D1
-- packet #271: new request after 1.2s idle timeout > record size is reset
and then ramps to 16K
-- packet #605: new request after 2.5s+ idle timeout > record size is reset

HTTP/1.1
-
http://www.webpagetest.org/result/140211_DF_df85fe6c3e91c96bd38e6c8d1df77ceb/4/details/(uses
2 TCP connections)
-- http://cloudshark.org/captures/b17452deed52?filter=tcp.stream%3D%3D5
--- transfers three objects with 1s+ idle timeouts, and record size is
reset and ramps in each run
-- http://cloudshark.org/captures/b17452deed52?filter=tcp.stream%3D%3D9
--- transfers three objects: record size is reset on new request (no idle
timeout - see packet #287), and also after 1s+ timeout.

The only difference between SPDY and HTTP/1.1 cases here is that HAProxy
"understands" HTTP/1.1 and resets the record size on one of the connections
with back-to-back requests. As we discussed earlier in the thread, I think
this behavior makes sense in HTTP/1.1 land, so that's great! Even better,
with the new logic, the SPDY connection also works as advertised, with
record size resets when buffer is empty + timeout is reached.

In short, this looks awesome. :)

One last set of followup question on configuration and defaults:
- we allow the user to tune buffer sizes - that's great.
- we allow the user to adjust record sizes: assuming above logic is in
place, can we change the default size to start small by default?
- should we expose the timeout as an additional config variable? I think
this is a reasonable knob to have. 1s could be a default value.

ig


On Sun, Feb 9, 2014 at 9:00 AM, Willy Tarreau  wrote:

> Hi Ilya,
>
> I've finished the change. It seems to do the right thing for me with
> HTTP, though I have not tested with SPDY.
>
> If a read happens after a pause of more than one second where the
> output buffer was empty, we reset the streamer flags. Thus it will
> cover the case where the client sends a new request and will not
> reset the flags in case of occasional congestion.
>
> The delay is hard-coded here in stream_interface.c:si_conn_recv_cb() :
>
> if ((chn->flags & (CF_STREAMER | CF_STREAMER_FAST)) &&
> !chn->buf->o &&
> (unsigned short)(now_ms - chn->last_read) >= 1000) {
>
> You may want to experiment with other values, though I'm not much
> convinced it's worth going really below if we don't want to hit some
> long RTTs on 3G during a slowstart happening with initcwnd=2.
>
> I'm appending the 3 patches to be applied on top of current git (though
> they should apply to what you already have).
>
> Do not hesitate to suggest further improvements !
>
> Thanks,
> Willy
>
>


Re: optimizing TLS time to first byte

2014-02-08 Thread Ilya Grigorik
On Thu, Feb 6, 2014 at 11:03 PM, Willy Tarreau  wrote:

> > Gotcha, thanks. As a follow up question, is it possible for me to control
> > the size of the read buffer?
>
> Yes, in the global section, you can set :
>
>   - tune.bufsize : size of the buffer
>   - tune.maxrewrite : reserve at the end of the buffer which is left
> untouched when receiving HTTP headers
>
> So during the headers phase, the buffer is considered full with
> (bufsize-maxrewrite) bytes. After that, it's bufsize only.
>

Perfect - thanks.


>  > > So if we're in this situation, this will be enough to reset the
> CF_STREAMER
> > > flag (2 consecutive incomplete reads). I think it would be worth
> testing
> > > it.
> > > A very simple way to test it in your environment would be to chain two
> > > instances, one in TCP mode deciphering, and one in HTTP mode.
> > >
> >
> > That's clever. I think for a realistic test we'd need a SPDY backend
> > though, since that's the only way we can actually get the multiplexed
> > streams flowing in parallel.
>
> Yes it would be interesting to know how it behaves.
>

Ok, I have the following setup: client -> haproxy (npn + tcp proxy) ->
spdylay (spdy 3.1 without TLS).

http://www.webpagetest.org/result/140208_DM_57a2a0feaf3258b93d7e3ce3c802b278/4/details/-
100ms RTT / 3Mbps down.
- tcpdump:
http://cloudshark.org/captures/666f2481eafa?filter=tcp.stream%3D%3D1
- Page loads two images > onload event > 1s later loads one image > 0.4s
later loads another image > 3.6s later loads last image.

All resources are loaded over the same SPDY connection (TLS terminated by
HAProxy, and SPDY-sans-tls by spdylay :)), and all assets are static
assets. As expected, session starts with 1400 byte records, then gets
bumped to ~16K records and continues using 16K records until the end of the
entire session -- since all resources here are static resources, there are
no gaps between HEADERS and DATA frames and we never hit the case of two
incomplete reads... This is somewhat suboptimal, since ideally we'd reset
record size when delivering the first image after 1s idle pause following
onload, and also when delivering the last image following the 3s+ idle
pause.

Now, let's imagine that HAProxy "understood" SPDY and had knowledge of the
individual streams (instead of running in tcp mode): it seems like we
*wouldn't* want the logic of new stream > reset record size to apply for
SPDY connections. Reset on new request makes sense in HTTP/1.1 mode since
everything is serialized and we're using multiple connections (although
even here this strategy can be suboptimal if we have back-to-back requests
on same connection and low RTT), but when we have many multiplexed streams
with SPDY, this behavior would lead to a lot of unnecessary resets - e.g.
multiple streams in flight, record size is at 16K, and new stream is
initiated and resets record size for everyone.

I'm back to wondering if the incomplete read strategy is the best approach
to take here. It seems like it wouldn't work out that well for SPDY /
HTTP/2 even if HAProxy understood those protocols... unless there was a lot
more smarts added for tracking parallel in-flight streams, etc. Instead,
seems like a connection idle timeout strategy would be much simpler and
would work uniformly well across HTTP/1, HTTP/2 (and any other protocol for
that matter) even without understanding any of them?

What do you guys think? Am I overlooking anything?

ig


Re: optimizing TLS time to first byte

2014-02-06 Thread Ilya Grigorik
> > (a) It's not clear to me how the threshold upgrade is determined? What
> > triggers the record size bump internally?
>
> The forwarding mechanism does two things :
>   - the read side counts the number of consecutive iterations that
> read() filled the whole receive buffer. After 3 consecutive times,
> it considers that it's a streaming transfer and sets the flag
> CF_STREAMER on the communication channel.
>
>   - after 2 incomplete reads, the flag disappears.
>
>   - the send side detects the number of times it can send the whole
> buffer at once. It sets CF_STREAMER_FAST if it can flush the
> whole buffer 3 times in a row.
>
>   - after 2 incomplete writes, the flag disappears.
>
> I preferred to only rely on CF_STREAMER and ignore the _FAST variant
> because it would only favor high bandwidth clients (it's used to
> enable splice() in fact). But I thought that CF_STREAMER alone would
> do the right job. And your WPT test seems to confirm this, when we
> look at the bandwidth usage!
>

Gotcha, thanks. As a follow up question, is it possible for me to control
the size of the read buffer?


> > (b) If I understood your earlier comment correctly, HAProxy will
> > automatically begin each new request with small record size... when it
> > detects that it's a new request.
>
> Indeed. In HTTP mode, it processes transactions (request+response), not
> connections, and each new transaction starts in a fresh state where these
> flags are cleared.


Awesome.


> > This works great if we're talking to a
> > backend in "http" mode: we parse the HTTP/1.x protocol and detect when a
> > new request is being processed, etc. However, what if I'm using HAProxy
> to
> > terminate TLS (+alpn negotiate) and then route the data to a "tcp" mode
> > backend.. which is my spdy / http/2 server talking over a non-encrypted
> > channel.
>
> Ah good point. I *suspect* that in practice it will work because :
>
>   - the last segment of the first transfer will almost always be incomplete
> (you don't always transfer exact multiples of the buffer size) ;
>   - the first response for the next request will almost always be
> incomplete
> (headers and not all data)
>

Ah, clever. To make this more interesting, say we have multiple streams in
flight: the frames may be interleaved and some streams may finish sooner
than others, but since multiple are in flight, chances are we'll be able to
fill the read buffer until the last stream completes.. which is actually
exactly what we want: we wouldn't want to reset the window at end of each
stream, but only when the connection goes quiet!


> So if we're in this situation, this will be enough to reset the CF_STREAMER
> flag (2 consecutive incomplete reads). I think it would be worth testing
> it.
> A very simple way to test it in your environment would be to chain two
> instances, one in TCP mode deciphering, and one in HTTP mode.
>

That's clever. I think for a realistic test we'd need a SPDY backend
though, since that's the only way we can actually get the multiplexed
streams flowing in parallel.


> > In this instance this logic wouldn't work, since HAProxy doesn't
> > have any knowledge or understanding of spdy / http/2 streams -- we'd
> start
> > the entire connection with small records, but then eventually upgrade it
> to
> > 16KB and keep it there, correct?
>
> It's not kept, it really depends on the transfer sizes all along. It
> matches
> more or less what you explained at the beginning of this thread, but based
> on transfer sizes at the lower layers.


Yep, this makes sense now - thanks.


>  > Any clever solutions for this? And on that note, are there future plans
> to
> > add "http/2" smarts to HAProxy, such that we can pick apart different
> > streams within a session, etc?
>
> Yes, I absolutely want to implement HTTP/2 but it will be time consuming
> and
> we won't have this for 1.5 at all. I also don't want to implement SPDY nor
> too early releases of 2.0, just because whatever we do will take a lot of
> time. Haproxy is a low level component, and each protocol adaptation is
> expensive to do. Not as much expensive as what people have to do with
> ASICs,
> but still harder than what some other products can do by using a small lib
> to perform the abstraction.
>

Makes sense, and great to hear!


> One of the huge difficulties we'll face will be to manage multiple streams
> over one connection. I think it will change the current paradigm of how
> requests are instanciated (which already started). From the very first
> version, we instanciated one "session" upon accept(), and this session
> contains buffers on which analyzers are plugged. The HTTP parsers are
> such analyzers. All the states and counters are stored at the session
> level. In 1.5, we started to change a few things. A connection is
> instanciated upon accept, then the session allocated after the connection
> is initialized (eg: SSL handshake complete). But splitting the sessions
> between

Re: optimizing TLS time to first byte

2014-02-05 Thread Ilya Grigorik
This is looking very promising! I created a simple page which loads a large
image (~1.5MB), then onload fires, and after about 5s of wait, another
image is fetched. All the assets are fetched over the same TCP connection.

- Sample WPT run:
http://www.webpagetest.org/result/140206_R2_0eab5be9abebd600c17f199158782114/3/details/
- tcpdump trace:
http://cloudshark.org/captures/5092d680b992?filter=tcp.stream%3D%3D4

All requests begin with a 1440 byte records (configured
as: tune.ssl.maxrecord=1400), and then get bumped to 16KB - awesome.

A couple of questions:

(a) It's not clear to me how the threshold upgrade is determined? What
triggers the record size bump internally?
(b) If I understood your earlier comment correctly, HAProxy will
automatically begin each new request with small record size... when it
detects that it's a new request. This works great if we're talking to a
backend in "http" mode: we parse the HTTP/1.x protocol and detect when a
new request is being processed, etc. However, what if I'm using HAProxy to
terminate TLS (+alpn negotiate) and then route the data to a "tcp" mode
backend.. which is my spdy / http/2 server talking over a non-encrypted
channel. In this instance this logic wouldn't work, since HAProxy doesn't
have any knowledge or understanding of spdy / http/2 streams -- we'd start
the entire connection with small records, but then eventually upgrade it to
16KB and keep it there, correct?

Any clever solutions for this? And on that note, are there future plans to
add "http/2" smarts to HAProxy, such that we can pick apart different
streams within a session, etc?

ig


On Sun, Feb 2, 2014 at 12:32 AM, Willy Tarreau  wrote:

> Hi Ilya,
>
> On Sat, Feb 01, 2014 at 11:33:50AM -0800, Ilya Grigorik wrote:
> > Hi Eric.
> >
> > 0001-MINOR-ssl-handshake-optimz-for-long-certificate-chai: works great!
> > After applying this patch the full cert is sent in one RTT and without
> any
> > extra pauses. [1]
>
> Cool, I'm impressed to see that the SSL time has been divided by 3! Great
> suggestion from you on this, thank you! I'll merge this one now.
>
> > 0002-MINOR-ssl-Set-openssl-max_send_fragment-using-tune.s: I'm testing
> with
> > / against openssl 1.0.1e, and it seems to work. Looking at the tcpdump,
> the
> > packets look identical to previous runs without this patch. [2]
> >
> > Any thoughts on dynamic sizing? ;)
>
> OK I've implemented it and tested it with good success. I'm seeing several
> small packets at the beginning, then large ones.
>
> However in order to do this I am not using Emeric's 0002 patch, because we
> certainly don't want to change the fragment size from the SSL stack upon
> every ssl_write() in dynamic mode, so I'm back to the initial principle of
> just moderating the buffer size. By using tune.ssl.maxrecord 2859, I'm
> seeing a few series of two segments of 1448 and 1437 bytes respectively,
> then larger ones up to 14-15kB that are coalesced by TSO on the NIC.
>
> It seems to do what we want :-)
>
> I'm attaching the patches if you're interested in trying it. However you'll
> have to revert patch 0002.
>
> Thanks for your tests and suggestions!
> Willy
>
>


Re: optimizing TLS time to first byte

2014-02-01 Thread Ilya Grigorik
Hi Eric.

0001-MINOR-ssl-handshake-optimz-for-long-certificate-chai: works great!
After applying this patch the full cert is sent in one RTT and without any
extra pauses. [1]
0002-MINOR-ssl-Set-openssl-max_send_fragment-using-tune.s: I'm testing with
/ against openssl 1.0.1e, and it seems to work. Looking at the tcpdump, the
packets look identical to previous runs without this patch. [2]

Any thoughts on dynamic sizing? ;)

P.S. Great stuff, thanks for looking into this!

[1]
http://www.webpagetest.org/result/140201_2X_03511ec63344f442b81c24d2bf39f59d/3/details/
[2]
http://www.webpagetest.org/result/140201_5D_67ac1ec2a4eec0bd84da3ee91a235ea5/5/details/


On Tue, Jan 28, 2014 at 7:31 AM, Emeric Brun  wrote:

> On 01/28/2014 03:58 PM, Emeric Brun wrote:
>
>>
>> Hi Ilya,
>>
>>
>>>
>>> Ah, interesting. Doing a bit more digging on this end, I see
>>> "SSL_set_max_send_fragment", albeit that's from back in 2005. Is that
>>> what you guys are looking at?
>>> https://github.com/openssl/openssl/commit/566dda07ba16f9d3b9774fd5c8d526
>>> d7cc93f179
>>>
>>>
>> Yes, that's it! it appears in openssl 1.0.0.
>>
>
> In attachment an other patch to test SSL_set_max_send_fragment.
>
> Regards,
> Emeric
>
>
>
>


Re: optimizing TLS time to first byte

2014-01-17 Thread Ilya Grigorik
On Fri, Jan 17, 2014 at 9:50 AM, Willy Tarreau  wrote:

> > Yup, that sounds like an interesting strategy. The only thing to note is
> > that you should consider resetting the record size after some idle
> timeout
> > -- same logic as slow-start after idle.
>
> We wouldn't even need this because the only reason for observing an idle
> period is that there's a new request/response cycle, and buffer flags are
> reset upon each new request so the "streamer" flag will automatically
> disappear.
>

Gotcha, makes sense.


>  > For changing the actual record
> > size, I don't believe there is much more to it than just changing how
> much
> > you write into the record vs. number of records you emit: last patch for
> > tuning record size accomplishes this by setting a max on "try" byte
> count.
>
> Indeed, but Emeric has found something interesting. It seems there's a
> tunable
> in recent OpenSSL versions ("mtu" or something like this) with which you
> can
> adjust how many bytes are sent over the wire in a single record. So it
> might
> end up being even more precise than playing with the max on "try". We'll
> see.
> Anyway your suggestion is very interesting and merits some experimentation!
> Lowering time to first byte is always something useful to get users happy!


Ah, interesting. Doing a bit more digging on this end, I see
"SSL_set_max_send_fragment", albeit that's from back in 2005. Is that what
you guys are looking at?
https://github.com/openssl/openssl/commit/566dda07ba16f9d3b9774fd5c8d526d7cc93f179

ig


Re: optimizing TLS time to first byte

2014-01-17 Thread Ilya Grigorik
Hey Willy.

On Fri, Jan 17, 2014 at 2:49 AM, Willy Tarreau  wrote:
>
>  >
> > (1) Certificates that exceed 4KB require an extra RTT even with IW10: HA
> > ships the first 4KB then pauses and waits for client ACK before
> proceeding
> > to send remainder of the certificate. At a minimum, this results in an
> > extra handshake RTT. You can see this in action here:
> >
> > WPT run: *
> http://www.webpagetest.org/result/140116_VW_3bd95a5cfb7e667498ef13b59639b9bf/2/details/
> > <
> http://www.webpagetest.org/result/140116_VW_3bd95a5cfb7e667498ef13b59639b9bf/2/details/
> >*
> >
> > tcpdump:
> >
> http://www.webpagetest.org/result/140116_VW_3bd95a5cfb7e667498ef13b59639b9bf/2.cap
> >
> >
> > I believe this is the same exact issue as fixed in nginx here:
> >
> https://github.com/nginx/nginx/commit/e52bddaaa90e64b2291f6e58ef1a2cff71604f6a#diff-0584d16332cf0d6dd9adb990a3c76a0cR539
>
> Useful stuff, indeed with the larger certs these days, maybe we should try
> to
> improve things.
>

The 4K+ case is a fairly common occurrence, so it would definitely be worth
the effort.

Firefox telemetry data for "plaintext bytes read before a server
certificate authenticated":
http://telemetry.mozilla.org/#aurora/28/SSL_BYTES_BEFORE_CERT_CALLBACK

The median is hovering right around 4K, which tells me that there would be
a lot of instances where we'd hit the extra RTT (ouch).

Older data (from 2010) from Qualys points to similar distribution (see
slide 27):
http://blog.ivanristic.com/Qualys_SSL_Labs-State_of_SSL_2010-v1.6.pdf

The other gotcha here is that Windows users get a double whammy (as can be
seen in the WPT trace), due to the delayed-ACK penalty (another 200ms) on
the client. Long story short, definitely worth fixing. :)


> > -
> >
> > (2) HA allows you to tune max_record_size - yay. That said, using a
> static
> > value introduces an inherent tradeoff between latency and throughput --
> > smaller records are good for latency, but hurt server throughput by
> adding
> > bytes and CPU overhead. It would be great if we could implement a smarter
> > strategy:
> >
> > http://mailman.nginx.org/pipermail/nginx-devel/2013-December/004703.html
> > http://mailman.nginx.org/pipermail/nginx-devel/2014-January/004748.html
> >
> > Would love to hear any thoughts on this. The advantage of above strategy
> is
> > that it can give you good performance out-of-the-box and without
> > sacrificing performance for different kinds of traffic (which is
> especially
> > relevant for a proxy...).
>
> I think we have an option here : haproxy already performs permanent
> observation
> of what the traffic looks like (streaming or interactive traffic). It uses
> this
> to decide to enable splicing or not. But it should be usable as well for
> SSL.
> The principle is quite simple : it counts the number of consecutive times
> it
> sees full buffer reads and writes to determine that it's forwarding a
> streaming
> content. Thus I think we could use that information to decide to enlarge
> the
> SSL buffers. I have no idea how these buffers are enlarged nor how much it
> costs to do that operation in the middle of traffic, but this is probably
> something we could experiment with.
>
> I'll discuss this with Emeric who knows SSL much better than me.
>

Yup, that sounds like an interesting strategy. The only thing to note is
that you should consider resetting the record size after some idle timeout
-- same logic as slow-start after idle. For changing the actual record
size, I don't believe there is much more to it than just changing how much
you write into the record vs. number of records you emit: last patch for
tuning record size accomplishes this by setting a max on "try" byte count.

ig


optimizing TLS time to first byte

2014-01-16 Thread Ilya Grigorik
Hey all.

I've spent some time looking into HAProxy (1.5-dev21) + TLS performance and
stumbled across a few areas where I think we could make some improvements.
In particular I'm interested in time to first byte, as that's a critical
piece for interactive traffic: time to first paint in browsers, time to
first frame for video, etc.

-

(1) Certificates that exceed 4KB require an extra RTT even with IW10: HA
ships the first 4KB then pauses and waits for client ACK before proceeding
to send remainder of the certificate. At a minimum, this results in an
extra handshake RTT. You can see this in action here:

WPT run: 
*http://www.webpagetest.org/result/140116_VW_3bd95a5cfb7e667498ef13b59639b9bf/2/details/
*

tcpdump:
http://www.webpagetest.org/result/140116_VW_3bd95a5cfb7e667498ef13b59639b9bf/2.cap


I believe this is the same exact issue as fixed in nginx here:
https://github.com/nginx/nginx/commit/e52bddaaa90e64b2291f6e58ef1a2cff71604f6a#diff-0584d16332cf0d6dd9adb990a3c76a0cR539

-

(2) HA allows you to tune max_record_size - yay. That said, using a static
value introduces an inherent tradeoff between latency and throughput --
smaller records are good for latency, but hurt server throughput by adding
bytes and CPU overhead. It would be great if we could implement a smarter
strategy:

http://mailman.nginx.org/pipermail/nginx-devel/2013-December/004703.html
http://mailman.nginx.org/pipermail/nginx-devel/2014-January/004748.html

Would love to hear any thoughts on this. The advantage of above strategy is
that it can give you good performance out-of-the-box and without
sacrificing performance for different kinds of traffic (which is especially
relevant for a proxy...).

Ilya