Hi Sandeep, Knox Team, Need urgent help/pointers in debugging this issue as this is very critical for us. So request you to please check on the same.
Thanks & Regards, Rajat From: Rajat Goel <rajat.g...@thalesgroup.com> Reply to: "user@knox.apache.org" <user@knox.apache.org> Date: Thursday, 15 August 2019 at 12:02 AM To: "user@knox.apache.org" <user@knox.apache.org> Subject: Re: Knox jetty threads stuck and Seeing ConnectionPoolTimeoutException while proxying a custom service Hi Sandeep, I added Knox certificate as trusted certificate on my UI Client machine and "avax.net.ssl.SSLException: Received fatal alert: certificate_unknown" errors no longer appear. However, other issues such as ‘org.eclipse.jetty.websocket.api.UpgradeException’, ‘java.io.IOException: Service connectivity error’ are still coming and UI eventually slows down and hangs. On UpgradeException, I had posted in my other mail, copying snippet here, please check: <snip> On analysing the debug logs further, I see that Websocket connection is not being established properly. Websocket connection from Service Client UI to Knox is being established properly but from Knox to Service backend is failing with UpgradeException. Not sure why. One thing I noticed is that upgrade request from Knox to Service Backend for websocket connection HTTP Destination is coming in logs as ‘ws://rafint001-mgt-01.cloud.in.guavus.com:9443’ (only till hostname and port) 2019-08-11 04:43:33,208 DEBUG client.WebSocketClient (WebSocketClient.java:connect(372)) - connect websocket org.eclipse.jetty.websocket.jsr356.endpoints.EndpointInstance@3cd5ba15 to ws://rafint001-mgt-01.cloud.in.guavus.com:9443/ 2019-08-11 04:43:33,221 DEBUG component.ContainerLifeCycle (ContainerLifeCycle.java:addBean(347)) - HttpClient@aa3cf89{STARTED} added {HttpDestination[ws://rafint001-mgt-01.cloud.in.guavus.com:9443]@33202ded,queue=0,pool=null,MANAGED} 2019-08-11 04:43:33,221 DEBUG component.AbstractLifeCycle (AbstractLifeCycle.java:setStarting(185)) - starting HttpDestination[ws://rafint001-mgt-01.cloud.in.guavus.com:9443]@33202ded,queue=0,pool=null whereas from UI to Knox this comes as ‘https://rafint001-mgt-01.cloud.in.guavus.com:8443/gateway/default/pdie/_sock/443/qu5llmni/websocket’. 2019-08-11 04:43:33,060 DEBUG server.HttpChannel (HttpChannel.java:onRequest(691)) - REQUEST for //rafint001-mgt-01.cloud.in.guavus.com:8443/gateway/default/pdie/_sock/443/qu5llmni/websocket on HttpChannelOverHttp@7440214c{r=1,c=false,a=IDLE,uri=//rafint001-mgt-01.cloud.in.guavus.com:8443/gateway/default/pdie/_sock/443/qu5llmni/websocket,age=0} GET //rafint001-mgt-01.cloud.in.guavus.com:8443/gateway/default/pdie/_sock/443/qu5llmni/websocket HTTP/1.1 Host: rafint001-mgt-01.cloud.in.guavus.com:8443^M Connection: Upgrade^M Pragma: no-cache^M Cache-Control: no-cache^M User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36^M Upgrade: websocket^M Origin: https://rafint001-mgt-01.cloud.in.guavus.com:8443^M<https://rafint001-mgt-01.cloud.in.guavus.com:8443%5eM> Sec-WebSocket-Version: 13^M Accept-Encoding: gzip, deflate, br^M Accept-Language: en-GB,en-US;q=0.9,en;q=0.8^M Cookie: KNOXSESSIONID=node0d0mr30gaibvc1y10buxlzrvkz0.node0; DEFAULT_UI=NEW^M Sec-WebSocket-Key: A/FYAfIjeAwb4QqqO2pepg==^M Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits Does this look right ? If not, please provide any pointers on how to debug this further. </snip> Attaching a log file from a new instance run after fixing certificate_unknown error. Sorry about sending such big files. Please check. Regards, Rajat From: Rajat Goel <rajat.g...@thalesgroup.com> Reply to: "user@knox.apache.org" <user@knox.apache.org> Date: Wednesday, 14 August 2019 at 8:20 PM To: "user@knox.apache.org" <user@knox.apache.org> Subject: Re: Knox jetty threads stuck and Seeing ConnectionPoolTimeoutException while proxying a custom service Hi Sandeep, Thanks for the reply. I see that this exception coming for connections from UI Client to Knox Gateway: 2019-08-11 04:43:27,432 DEBUG ssl.SslConnection (SslConnection.java:handshakeFailed(764)) - handshake failed SslConnection@1d3697f3::SocketChannelEndPoint@6d000fcf{/192.168.141.33:35102<->/192.168.141.31:8443,OPEN,fill=-,flush=-,to=2/300000}{io=0/0,kio=0,kro=1}->SslConnection@1d3697f3{NEED_UNWRAP,eio=0/-1,di=-1,fill=IDLE,flush=IDLE}~>DecryptedEndPoint@20e999e9{/192.168.141.33:35102<->/192.168.141.31:8443,OPEN,fill=-,flush=-,to=99/300000}=>HttpConnection@264e99b9[p=HttpParser{s=START,0 of -1},g=HttpGenerator@717c736d{s=START}]=>HttpChannelOverHttp@570d89da{r=0,c=false,a=IDLE,uri=null,age=0} javax.net.ssl.SSLException: Received fatal alert: certificate_unknown Now to fix this, do I have to add Knox certificate in my UI Client machine i.e. my laptop’s cacert or what else ? One more query: For any SSL enabled service to be integrated with Knox, the same step i.e. adding Knox certificate in UI Client’s truststore is mandatory ? Regards, Rajat From: Sandeep Moré <moresand...@gmail.com> Reply to: "user@knox.apache.org" <user@knox.apache.org> Date: Wednesday, 14 August 2019 at 7:36 PM To: "user@knox.apache.org" <user@knox.apache.org> Subject: Re: Knox jetty threads stuck and Seeing ConnectionPoolTimeoutException while proxying a custom service Hello Rajat, I see "avax.net.ssl.SSLException: Received fatal alert: certificate_unknown" errors in the logs, this appears to be the root cause. Appears to be a certificate issue. On Mon, Aug 12, 2019 at 1:51 PM Rajat Goel <rajat.g...@thalesgroup.com<mailto:rajat.g...@thalesgroup.com>> wrote: Attaching full debug log as well. From: Rajat Goel <rajat.g...@thalesgroup.com<mailto:rajat.g...@thalesgroup.com>> Reply to: "user@knox.apache.org<mailto:user@knox.apache.org>" <user@knox.apache.org<mailto:user@knox.apache.org>> Date: Monday, 12 August 2019 at 10:50 PM To: "user@knox.apache.org<mailto:user@knox.apache.org>" <user@knox.apache.org<mailto:user@knox.apache.org>> Subject: Re: Knox jetty threads stuck and Seeing ConnectionPoolTimeoutException while proxying a custom service Hi Sandeep, Thanks for your reply. Yes, the websocket traffic works as expected initially. Connection is established and data flows correctly. Websocket backend is secured as well. To check if the issue is due to SSL or not, I disabled SSL on My service and the issue got reproduced with no secure (non SSL) setup as well. Attaching instance logs from one cycle of reproduction. This instance uses SSOCookieProvider (for SSO) and had DEBUG logs enabled only for org.apache.knox.gateway package. Full debug logs file instance is large in size so cannot send via mail but let me know if you need that as well. Also attaching thread dump of Knox for one particular instance when Knox threads were stuck. I don’t see any errors in service log files. The issue is reproducible every time so if you need any other information, please do let me know. Thanks & Regards, Rajat From: Sandeep Moré <moresand...@gmail.com<mailto:moresand...@gmail.com>> Reply to: "user@knox.apache.org<mailto:user@knox.apache.org>" <user@knox.apache.org<mailto:user@knox.apache.org>> Date: Monday, 12 August 2019 at 7:37 PM To: "user@knox.apache.org<mailto:user@knox.apache.org>" <user@knox.apache.org<mailto:user@knox.apache.org>> Subject: Re: Knox jetty threads stuck and Seeing ConnectionPoolTimeoutException while proxying a custom service Hello Rajat, Do you see Websocket traffic work as expected the first time when the UI works ? can you check using the developer console if the initial Websocket connection was established and data is flowing correctly? Looks like you are connecting to a secure backend, is the Websocket backend secure as well ? Can you also post a redacted gateway.log file with DEBUG logging, it would be helpful to see the entire log file with the stacktrace. Also, do you see any errors in your service log files ? Best, Sandeep On Mon, Aug 12, 2019 at 1:08 AM Rajat Goel <rajat.g...@thalesgroup.com<mailto:rajat.g...@thalesgroup.com>> wrote: Hello, I need some help while trying to integrate my custom service with Knox. Here’s how the service is deployed: _ Knox (Instance 1) _ _ My Service1 (Master). | | | UI Client -> Load Balancer --> | | -- For Availability --> | (HA Proxy) | | (HA Proxy) | | _ Knox (Instance 2) _ | | _ My Service2 (Standby) My custom service UI client uses both web sockets as well as REST calls to talk to My Service backend. I have written service definition xml and rewrite xml files and deployed them on a test setup as per the above architecture. My Knox version is 1.0 (HDP 3.1). I have added the service in Knox default topology and default topology uses Shiro Provider for authentication and ‘gateway.websocket.feature.enabled’ is set to true. When Knox service is started, I see the following behavior: * When I launch the UI from browser, I get Shiro authentication dialog popup, I enter my credentials and get redirected to my Service UI. Initially, for the first couple of minutes, UI works fine. All calls from UI are getting routed properly to backend. * After first few minutes, I start seeing following exception in Knox gateway logs: 2019-08-10 15:37:32,053 ERROR gateway.websockets (ProxyWebSocketAdapter.java:onWebSocketConnect(105)) - Unable to connect to websocket server: java.io.IOException: Connect failure java.io.IOException: Connect failure Caused by: org.eclipse.jetty.websocket.api.UpgradeException: 0 null at org.eclipse.jetty.websocket.client.WebSocketUpgradeRequest.onComplete(WebSocketUpgradeRequest.java:515) Caused by: java.io.EOFException: HttpConnectionOverHTTP@1e8a6128::SocketChannelEndPoint@5526bc43{rafint001-mgt-01.cloud.in.guavus.com/192.168.141.33:9443<http://rafint001-mgt-01.cloud.in.guavus.com/192.168.141.33:9443><->/192.168.141.31:49860<http://192.168.141.31:49860>,ISHUT,fill=-,flush=-,to=3/0}{io=0/0,kio=0,kro=1}->HttpConnectionOverHTTP@1e8a6128(l:/192.168.141.31:49860<http://192.168.141.31:49860> <-> r:rafint001-mgt-01.cloud.in.guavus.com/192.168.141.33:9443,closed=false)=<http://rafint001-mgt-01.cloud.in.guavus.com/192.168.141.33:9443,closed=false)=>>HttpChannelOverHTTP@500cd5d(exchange=HttpExchange@75bf9768 req=TERMINATED/null@null res=PENDING/null@null)[send=HttpSenderOverHTTP@6678aae8(req=QUEUED,snd=COMPLETED,failure=null)[HttpGenerator@6dcd6c0{s=START}],recv=HttpReceiverOverHTTP@5358963(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]] ... 13 more * UI starts slowing down and response times increase. * I also see DEBUG exceptions such as: DEBUG io.FillInterest (FillInterest.java:onFail(134)) - onFail FillInterest@1d803962{null} java.util.concurrent.TimeoutException: Idle timeout expired: 300000/300000 ms at org.eclipse.jetty.io.IdleTimeout.checkIdleTimeout(IdleTimeout.java:166) at org.eclipse.jetty.io.IdleTimeout$1.run(IdleTimeout.java:50) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) DEBUG io.WriteFlusher (WriteFlusher.java:onFail(471)) - ignored: WriteFlusher@3a2f4fc0{IDLE}->null java.nio.channels.ClosedChannelException at org.eclipse.jetty.io.WriteFlusher.onClose(WriteFlusher.java:502) at org.eclipse.jetty.io.AbstractEndPoint.onClose(AbstractEndPoint.java:353) at org.eclipse.jetty.io.ChannelEndPoint.onClose(ChannelEndPoint.java:216) at org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:225) at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192) at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175) at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.doClose(SslConnection.java:1132) at org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220) DEBUG ssl.SslConnection (SslConnection.java:flush(950)) - SslConnection@1cb9d1f9::SocketChannelEndPoint@6fa560e8{/192.168.141.31:44258<http://192.168.141.31:44258><->/192.168.141.31:8443<http://192.168.141.31:8443>,ISHUT,fill=-,flush=-,to=8/300000}{io=0/0,kio=0,kro=1}->SslConnection@1cb9d1f9{NEED_UNWRAP,eio=-1/-1,di=-1,fill=IDLE,flush=IDLE}~>DecryptedEndPoint@205f9957{/192.168.141.31:44258<http://192.168.141.31:44258><->/192.168.141.31:8443<http://192.168.141.31:8443>,CLOSED,fill=-,flush=-,to=9/300000}=>HttpConnection@11b5199c[p=HttpParser{s=CLOSED,0 of -1},g=HttpGenerator@40706618{s=START}]=>HttpChannelOverHttp@655d0b98{r=0,c=false,a=IDLE,uri=null,age=0} java.io.IOException: Broken pipe at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.flush(SslConnection.java:847) at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.doShutdownOutput(SslConnection.java:1076) at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.doClose(SslConnection.java:1131) at org.eclipse.jetty.io.AbstractEndPoint.doOnClose(AbstractEndPoint.java:220) at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:192) at org.eclipse.jetty.io.AbstractEndPoint.close(AbstractEndPoint.java:175) at org.eclipse.jetty.io.AbstractConnection.close(AbstractConnection.java:248) at org.eclipse.jetty.server.HttpChannelOverHttp.earlyEOF(HttpChannelOverHttp.java:234) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:1551) at org.eclipse.jetty.server.HttpConnection.parseRequestBuffer(HttpConnection.java:360) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:250) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at org.eclipse.jetty.io.ssl.SslConnection$DecryptedEndPoint.onFillable(SslConnection.java:411) * Few minutes later, UI hangs. I start seeing following Exceptions in Knox gateway log: 2019-08-10 15:47:41,366 WARN knox.gateway (DefaultDispatch.java:executeOutboundRequest(147)) - Connection exception dispatching request: https://rafint001-mgt-01.cloud.in.guavus.com:9443/_sock/iframe.html org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:313) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:279) 2019-08-10 15:47:41,371 ERROR knox.gateway (AbstractGatewayFilter.java:doFilter(63)) - Failed to execute filter: java.io.IOException: Service connectivity error. java.io.IOException: Service connectivity error. at org.apache.knox.gateway.dispatch.DefaultDispatch.executeOutboundRequest(DefaultDispatch.java:148) at org.apache.knox.gateway.dispatch.DefaultDispatch.executeRequest(DefaultDispatch.java:116) at org.apache.knox.gateway.dispatch.DefaultDispatch.doGet(DefaultDispatch.java:278) * Few minutes later, when I refresh UI page (after clearing any cookies) I see HTTP 401 Error on UI. This time no Shiro authentication dialog popup. Nothing seems to work hereafter. I see following logs only in gateway.log: 2019-08-10 15:40:10,667 DEBUG knox.gateway (GatewayFilter.java:doFilter(119)) - Received request: GET /pdie/cdap/ 2019-08-10 15:40:10,667 DEBUG authc.BasicHttpAuthenticationFilter (BasicHttpAuthenticationFilter.java:sendChallenge(274)) - Authentication required: sending 401 Authentication challenge response. * When I restart Knox service from Ambari or run a ‘touch’ command on topology file (/etc/knox/conf/topologies/default.xml), UI starts working again and the above set of issues repeats. * I tried increasing gateway threads (gateway.threadpool.max to 500) and HTTP connections (gateway.httpclient.maxConnections to 100). With these settings, UI works fine for a little more time as compared to scenario with default values but eventually hit the same issues as above. * When UI hangs, I tried to thread dump of Knox service. I see that many of the threads in Knox are stuck in socket read calls with following trace: "qtp2099051403-139" #139 prio=5 os_prio=0 tid=0x00007f4d7c002000 nid=0x6ed8 runnable [0x00007f4dee0e0000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) at java.net.SocketInputStream.read(SocketInputStream.java:171) at java.net.SocketInputStream.read(SocketInputStream.java:141) at sun.security.ssl.InputRecord.readFully(InputRecord.java:465) at sun.security.ssl.InputRecord.read(InputRecord.java:503) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975) - locked <0x00000006f471e218> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933) at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) - locked <0x00000006f471edc0> (a sun.security.ssl.AppInputStream) at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at org.apache.knox.gateway.dispatch.DefaultDispatch.executeOutboundRequest(DefaultDispatch.java:130) at org.apache.knox.gateway.dispatch.DefaultDispatch.executeRequest(DefaultDispatch.java:116) at org.apache.knox.gateway.dispatch.DefaultDispatch.doPost(DefaultDispatch.java:305) at org.apache.knox.gateway.dispatch.GatewayDispatchFilter$PostAdapter.doMethod(GatewayDispatchFilter.java:177) at org.apache.knox.gateway.dispatch.GatewayDispatchFilter.doFilter(GatewayDispatchFilter.java:122) at org.apache.knox.gateway.filter.AbstractGatewayFilter.doFilter(AbstractGatewayFilter.java:61) at org.apache.knox.gateway.GatewayFilter$Holder.doFilter(GatewayFilter.java:372) at org.apache.knox.gateway.GatewayFilter$Chain.doFilter(GatewayFilter.java:272) Can some please help on resolving the above set of issues: 1. With the deployment architecture as described above: UI -> HAProxy -> Knox -> HAProxy -> Service backend, are there any specific configurations required in HA Proxy and Knox to make this work. Is this right way to deploy ? 2. ‘org.eclipse.jetty.websocket.api.UpgradeException’ and how can this be fixed? 3. How to debug and fix Knox Jetty threads hang ? Thanks & Regards, Rajat