[ 
https://issues.apache.org/jira/browse/SSHD-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14244009#comment-14244009
 ] 

Guillaume Nodet commented on SSHD-348:
--------------------------------------

One thing to note is that the fact that threads are stuck in this state is not 
an abnormal condition per se.
If jenkins does not read the data fast enough, the server threads will be stuck.
One way to solve that problem would be to have the gerrit commands implement 
AsyncCommand and leverage the given IoOutputStream to write events in a non 
blocking way.

Anyway, I've reproduced a use case when the consumption of messages is not fast 
enough.  The system tcp buffer will be filled, which will block the writes on 
the server side (because the default ssh window is much bigger than the tcp 
buffer).  Further writes will thus be delayed until the all previous data has 
been read by the consumer.  This is especially the case for disconnect messages 
due to an idle session timeout (in case the consumer is stuck).  I'll add a 
timeout for this very case.

Another point that could cause issue is when 
ChannelSession#doCloseImmediately() is called, if the Command#destroy() method 
blocks in some way, though I'm not sure it's the case.

Can someone try to grab a full thread dump to see if there is any other thread 
stuck somehow ?  Also another peek at the internal values on the stuck channel 
would be helpful.  I need to know if something has changed compared to the 
values we had before 0.13.0, in particular, I need to understand if the 
ChannelSession#doCloseImmediately() has been called or not.
{{code}}
remoteWindow.size
remoteWindow.waiting
remoteWindow.closed
commandExitFuture.result
gracefulState.value
gracefulFuture.result
state.value
closeFuture.result
service.state.value
service.closeFuture.result
session.state.value
session.closeFuture.result
command.done
{{code}}



> Some SSH threads get blocked in Object.wait() method forever
> ------------------------------------------------------------
>
>                 Key: SSHD-348
>                 URL: https://issues.apache.org/jira/browse/SSHD-348
>             Project: MINA SSHD
>          Issue Type: Bug
>    Affects Versions: 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0
>         Environment: Gerrit Code Review 2.9.1 (SSHD 0.12.0)
> Gerrit Code Review 2.9.2 (SSHD 0.13.0)
> Gerrit Code Review 2.9.3 (Downgraded to SSHD 0.9)
>            Reporter: David Ostrovsky
>            Assignee: Guillaume Nodet
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: 0001-Prepare-release-sshd-0.13.0-72f868e26.patch, diff
>
>
> This seems to be a regression started from 0.10.1.
> In Gerrit we have SSH commamds that returns immediately and so called 
> stream-events command which keeps connection open until clients disconnect. 
> Basically for days or weeks. This is used for example to inform CI system 
> (jenkins) about events in gerrit, like new patch set upload etc.
> In Gerrit we have configurable "SSH-Stream-Worker" thread pool which is 
> dedicated to the mentioned stream-events SSH command. The observed behaviour 
> on latest SSHD release is that after some time all threads are stuck. They 
> aren't stuck at the same time, but one after another untill at some time all 
> threads are stuck and Gerrit must be restarted. Usually after one week. The 
> stack dump of all such stuck thread are the same, see below. If we had a 
> patch we could apply it to our production Gerrit instance and try if this 
> helps.
> {code}
> "SSH-Stream-Worker-10" cpu=95400.00 [reset 95400.00] ms elapsed=146444.30 
> [reset 146444.30] s allocated=5526700000 B (5.15 GB) [reset 5526700000 B 
> (5.15 GB)] defined_classes=0
> io= file i/o: 15622752/0 B, net i/o: 0/0 B, files opened:0, socks opened:0  
> [reset file i/o: 15622752/0 B, net i/o: 0/0 B, files opened:0, socks opened:0 
> ] 
> prio=10 tid=0x00007f54514df800 nid=0x1c71 / 7281  pthread-id=139999281374976 
> in Object.wait()  [_thread_blocked (_at_safepoint), 
> stack(0x00007f541f5f6000,0x00007f541f6f7000)] [0x00007f541f6f5000]
>    java.lang.Thread.State: WAITING (on object monitor)
>       at java.lang.Object.wait(J)V(Native Method)
>       - waiting on <0x00007f553aa530d0> (a 
> org.apache.sshd.common.channel.Window)
>       at java.lang.Object.wait()V(Object.java:503)
>       at 
> org.apache.sshd.common.channel.Window.waitForSpace()I(Window.java:148)
>       - locked <0x00007f553aa530d0> (a org.apache.sshd.common.channel.Window)
>       at 
> org.apache.sshd.common.channel.ChannelOutputStream.flush()V(ChannelOutputStream.java:116)
>       - locked <0x00007f553aa55060> (a 
> org.apache.sshd.common.channel.ChannelOutputStream)
>       at 
> org.apache.sshd.common.channel.ChannelOutputStream.write([BII)V(ChannelOutputStream.java:84)
>       - locked <0x00007f553aa55060> (a 
> org.apache.sshd.common.channel.ChannelOutputStream)
>       at sun.nio.cs.StreamEncoder.writeBytes()V(StreamEncoder.java:221)
>       at sun.nio.cs.StreamEncoder.implFlushBuffer()V(StreamEncoder.java:291)
>       at sun.nio.cs.StreamEncoder.implFlush()V(StreamEncoder.java:295)
>       at sun.nio.cs.StreamEncoder.flush()V(StreamEncoder.java:141)
>       - locked <0x00007f553aa7e258> (a java.io.OutputStreamWriter)
>       at java.io.OutputStreamWriter.flush()V(OutputStreamWriter.java:229)
>       at java.io.BufferedWriter.flush()V(BufferedWriter.java:254)
>       - locked <0x00007f553aa7e258> (a java.io.OutputStreamWriter)
>       at java.io.PrintWriter.flush()V(PrintWriter.java:320)
>       - locked <0x00007f553aa7e210> (a java.io.BufferedWriter)
>       at java.io.PrintWriter.checkError()Z(PrintWriter.java:357)
>       at 
> com.google.gerrit.sshd.commands.StreamEvents.writeEvents()V(StreamEvents.java:186)
>       at 
> com.google.gerrit.sshd.commands.StreamEvents.access$100(Lcom/google/gerrit/sshd/commands/StreamEvents;)V(StreamEvents.java:43)
>       at 
> com.google.gerrit.sshd.commands.StreamEvents$3.run()V(StreamEvents.java:82)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;(Executors.java:471)
>       at java.util.concurrent.FutureTask.run()V(FutureTask.java:262)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(Ljava/util/concurrent/ScheduledThreadPoolExecutor$ScheduledFutureTask;)V(ScheduledThreadPoolExecutor.java:178)
>       at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V(ScheduledThreadPoolExecutor.java:292)
>       at 
> com.google.gerrit.server.git.WorkQueue$Task.run()V(WorkQueue.java:364)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V(ThreadPoolExecutor.java:1145)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run()V(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run()V(Thread.java:812)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to