[ 
https://issues.apache.org/jira/browse/TS-3085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sudheer Vinukonda updated TS-3085:
----------------------------------
            Description: 
We ran into a production issue where large POSTs (30MB or high) are failing 
over slower connection speeds after ats5 roll out (the problem could be easily 
reproduced using a charles proxy with throttling enabled). 

Further debugging isolated the issue to uploads over SSL connections and after 
a lot of debugging the issue appears to be the below:

ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
error in the read. This is repeated until either the complete data is read or 
an error occurs. However, from the openssl documentation, it is recommended to 
call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to ensure 
the error queue is clean of any leftover/garbage errors.  It's not clear what 
might be corrupting the error queue of the SSL context in a tight loop - 
possibly, some new feature in ats5. In any case, calling ERR_clear_error() is a 
good idea and adding this seems to resolve the post failures.

Documentation from openSSL and some related notes on stackoverflow:

https://www.openssl.org/docs/ssl/SSL_get_error.html

http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error


{code}
"SSL_get_error() returns a result code (suitable for the C ``switch''
statement) for a preceding call to SSL_connect(), SSL_accept(),
SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
parameter ret.

In addition to ssl and ret, SSL_get_error() inspects the current thread's
OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread that
performed the TLS/SSL I/O operation, and no other OpenSSL function calls should
appear in between. The current thread's error queue must be empty before the
TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably."

"SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
the error stays in the queue.

You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write
etc) that is followed by SSL_get_error, otherwise you may be reading an old
error that occurred previously in the current thread."
{code}

  was:
We ran into a production issue where large POSTs (30MB or high) are failing 
over slower connection speeds after ats5 roll out (the problem could be easily 
reproduced using a charles proxy with throttling enabled). 

Further debugging isolated the issue to uploads over SSL connections and after 
a lot of debugging the issue appears to be the below:



ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
error in the read. This is repeated until either the complete data is read or 
an error occurs. However, from the openssl documentation, it is recommended to 
call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to ensure 
the error queue is clean of any leftover/garbage errors.  It's not clear what 
might be corrupting the error queue of the SSL context in a tight loop - 
possibly, some new feature in ats5. In any case, calling ERR_clear_error() is a 
good idea and adding this seems to resolve the post failures.

Documentation from openSSL and some related notes on stackoverflow:

https://www.openssl.org/docs/ssl/SSL_get_error.html

http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error


{code}
"SSL_get_error() returns a result code (suitable for the C ``switch''
statement) for a preceding call to SSL_connect(), SSL_accept(),
SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
parameter ret.

In addition to ssl and ret, SSL_get_error() inspects the current thread's
OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread that
performed the TLS/SSL I/O operation, and no other OpenSSL function calls should
appear in between. The current thread's error queue must be empty before the
TLS/SSL I/O operation is attempted, or SSL_get_error() will not work reliably."

"SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
the error stays in the queue.

You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, SSL_write
etc) that is followed by SSL_get_error, otherwise you may be reading an old
error that occurred previously in the current thread."
{code}

      Affects Version/s: 5.0.1
    Backport to Version: 5.1.1
          Fix Version/s: 5.2.0
               Assignee: Sudheer Vinukonda
                 Labels: yahoo  (was: )

The fix is really simple - to basically call ERR_Clear_error() before 
SSL_Read(). I will investigate separately on why/who is corrupting the error 
stack for this SSL context.

> Large POSTs over (relatively) slower connections failing in ats5
> ----------------------------------------------------------------
>
>                 Key: TS-3085
>                 URL: https://issues.apache.org/jira/browse/TS-3085
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: SSL
>    Affects Versions: 5.0.1
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>              Labels: yahoo
>             Fix For: 5.2.0
>
>
> We ran into a production issue where large POSTs (30MB or high) are failing 
> over slower connection speeds after ats5 roll out (the problem could be 
> easily reproduced using a charles proxy with throttling enabled). 
> Further debugging isolated the issue to uploads over SSL connections and 
> after a lot of debugging the issue appears to be the below:
> ATS calls SSL_read() followed by SSL_get_error() to check if there was any 
> error in the read. This is repeated until either the complete data is read or 
> an error occurs. However, from the openssl documentation, it is recommended 
> to call ERR_clear_error() prior to calling SSL_read() + SSL_get_error() to 
> ensure the error queue is clean of any leftover/garbage errors.  It's not 
> clear what might be corrupting the error queue of the SSL context in a tight 
> loop - possibly, some new feature in ats5. In any case, calling 
> ERR_clear_error() is a good idea and adding this seems to resolve the post 
> failures.
> Documentation from openSSL and some related notes on stackoverflow:
> https://www.openssl.org/docs/ssl/SSL_get_error.html
> http://stackoverflow.com/questions/18179128/how-to-manage-the-error-queue-in-openssl-ssl-get-error-and-err-get-error
> {code}
> "SSL_get_error() returns a result code (suitable for the C ``switch''
> statement) for a preceding call to SSL_connect(), SSL_accept(),
> SSL_do_handshake(), SSL_read(), SSL_peek(), or SSL_write() on ssl. The value
> returned by that TLS/SSL I/O function must be passed to SSL_get_error() in
> parameter ret.
> In addition to ssl and ret, SSL_get_error() inspects the current thread's
> OpenSSL error queue. Thus, SSL_get_error() must be used in the same thread 
> that
> performed the TLS/SSL I/O operation, and no other OpenSSL function calls 
> should
> appear in between. The current thread's error queue must be empty before the
> TLS/SSL I/O operation is attempted, or SSL_get_error() will not work 
> reliably."
> "SSL_get_error does not call ERR_get_error. So if you just call SSL_get_error,
> the error stays in the queue.
> You should be calling ERR_clear_error prior to ANY SSL-call(SSL_read, 
> SSL_write
> etc) that is followed by SSL_get_error, otherwise you may be reading an old
> error that occurred previously in the current thread."
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to