Thanks Alex. I understand the reason for synchronization of
note<->client_connection. However, I don't think I understand why if I
request LIST_NOTES which does not involve any changes, the server sends the
list of notes to all clients using broadcastNoteList() which uses
broadcastAll.

After deploying the changes I mentioned earlier, the server ran fine for 18
hours before running into a deadlock (jstack output below). We could
download the top level page and notes but not run any paragraphs. Server
restart fixed the problem. Do you think this is a result of my changes or a
separate issue?

Found one Java-level deadlock:
=============================
"qtp873175411-3443":
  waiting to lock monitor 0x00000000031e6158 (object 0x00000006c3b1fba8, a
java.util.HashMap),
  which is held by "DefaultQuartzScheduler_Worker-4"
"DefaultQuartzScheduler_Worker-4":
  waiting to lock monitor 0x000000000268ad58 (object 0x00000006c34a12c0, a
java.util.ArrayList),
  which is held by "DefaultQuartzScheduler_Worker-2"
"DefaultQuartzScheduler_Worker-2":
  waiting to lock monitor 0x00000000031e6158 (object 0x00000006c3b1fba8, a
java.util.HashMap),
  which is held by "DefaultQuartzScheduler_Worker-4"

Java stack information for the threads listed above:
===================================================
"qtp873175411-3443":
at
org.apache.zeppelin.interpreter.InterpreterFactory.getNoteInterpreterSettingBinding(InterpreterFactory.java:502)
- waiting to lock <0x00000006c3b1fba8> (a java.util.HashMap)
at
org.apache.zeppelin.notebook.NoteInterpreterLoader.getInterpreterSettings(NoteInterpreterLoader.java:60)
at
org.apache.zeppelin.socket.NotebookServer.sendAllAngularObjects(NotebookServer.java:951)
at
org.apache.zeppelin.socket.NotebookServer.sendNote(NotebookServer.java:437)
at
org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:123)
at
org.apache.zeppelin.socket.NotebookSocket.onMessage(NotebookSocket.java:70)
at
org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameHandler.onFrame(WebSocketConnectionRFC6455.java:835)
at
org.eclipse.jetty.websocket.WebSocketParserRFC6455.parseNext(WebSocketParserRFC6455.java:349)
at
org.eclipse.jetty.websocket.WebSocketConnectionRFC6455.handle(WebSocketConnectionRFC6455.java:225)
at org.eclipse.jetty.io.nio.SslConnection.handle(SslConnection.java:196)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
at
org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Thread.java:745)
"DefaultQuartzScheduler_Worker-4":
at org.apache.zeppelin.notebook.Note.getParagraphs(Note.java:441)
- waiting to lock <0x00000006c34a12c0> (a java.util.ArrayList)
at
org.apache.zeppelin.search.LuceneSearch.updateIndexDoc(LuceneSearch.java:172)
at org.apache.zeppelin.notebook.Note.persist(Note.java:463)
at
org.apache.zeppelin.socket.NotebookServer$ParagraphJobListener.afterStatusChange(NotebookServer.java:935)
at org.apache.zeppelin.scheduler.Job.setStatus(Job.java:143)
at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:271)
at org.apache.zeppelin.scheduler.Job.abort(Job.java:232)
at
org.apache.zeppelin.interpreter.InterpreterFactory.stopJobAllInterpreter(InterpreterFactory.java:593)
at
org.apache.zeppelin.interpreter.InterpreterFactory.restart(InterpreterFactory.java:547)
- locked <0x00000006c3b1fba8> (a java.util.HashMap)
at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:440)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
- locked <0x00000006c3ac3dc0> (a java.lang.Object)
"DefaultQuartzScheduler_Worker-2":
at
org.apache.zeppelin.interpreter.InterpreterFactory.getNoteInterpreterSettingBinding(InterpreterFactory.java:502)
- waiting to lock <0x00000006c3b1fba8> (a java.util.HashMap)
at
org.apache.zeppelin.notebook.NoteInterpreterLoader.getInterpreterSettings(NoteInterpreterLoader.java:60)
at
org.apache.zeppelin.notebook.NoteInterpreterLoader.get(NoteInterpreterLoader.java:77)
at org.apache.zeppelin.notebook.Note.runAll(Note.java:409)
- locked <0x00000006c34a12c0> (a java.util.ArrayList)
at org.apache.zeppelin.notebook.Notebook$CronJob.execute(Notebook.java:419)
at org.quartz.core.JobRunShell.run(JobRunShell.java:202)
at
org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:573)
- locked <0x00000006c3abd630> (a java.lang.Object)

Found 1 deadlock.

On Thu, Apr 7, 2016 at 6:46 PM, Alexander Bezzubov <b...@apache.org> wrote:

> Hi,
>
> thank you Eric, upgrading Jetty sounds like a great idea!
>
> Prasad, I think braodcastAll and synchronization of
> note<->client_connection is used by default to achieve the ability to
> collaborate over analysis with multiple people at same Note in realtime -
> to notify all other clients who have this Note open about the changes that
> you did in your browser tab (like in 2 different tabs you can see).
>
> I believe it might be possible to replace a map with concurrent
> implementation to avoid excessive synchronization though, as we did in [1]
> before. If same behaviour persist after upgrading to Jetty 9, could you
> pelase create an separate issue for that and I will be happy help and look
> more into it.
>
> Thanks!
>
>  1. https://issues.apache.org/jira/browse/ZEPPELIN-312
>
> --
> Alex
>
>
> On Fri, Apr 8, 2016 at 1:28 AM, Prasad Wagle <prasadwa...@gmail.com>
> wrote:
>
>> Thanks Eric! I created https://issues.apache.org/jira/browse/ZEPPELIN-798
>> - Migrate to Jetty version 9 that has fix for websocket deadlock bug
>> causing Zeppelin server hangs. This is pretty important for us so please
>> let me know how I can help.
>>
>> For now, I have made some changes to reduce websocket communications and
>> probability of hangs:
>>
>>    - For the LIST_NOTES operation, I use broadcastNoteList(conn) that
>>    sends note list to the current connection instead of using broadcastAll.
>>    What is the reason for using broadcastAll?
>>    - I removed synchronized (noteSocketMap) from broadcast so that one
>>    bad socket does not hang the server. Do you think this can cause serious
>>    problems?
>>
>>
>> On Thu, Apr 7, 2016 at 3:06 AM, Eric Charles <e...@apache.org> wrote:
>>
>>> On 07/04/16 07:18, Prasad Wagle wrote:
>>>
>>>> Hi,
>>>>
>>>> We experienced three Zeppelin server hangs today. I have included one of
>>>> the stack traces below. It is similar to the stack trace in a websocket
>>>> deadlock bug in Jetty 8. From the bug report
>>>> <https://bugs.eclipse.org/bugs/show_bug.cgi?id=389645>:
>>>>
>>>>     However, Jetty 9 has already refactored the low level read/write on
>>>>     a socket heavily to compensate for websocket, spdy, and http/2
>>>>     Marking this as WONTFIX for Jetty 7/8
>>>>     Use Jetty 9
>>>>
>>>>
>>>> Is there a workaround? Has anyone tried using Jetty 9 in Zeppelin? What
>>>> is the effort involved?
>>>>
>>>
>>>
>>> I have upgraded the source code to Jetty 9 which implies a few different
>>> constructs.
>>>
>>> Could you open a JIRA? I will then submit a PRo
>>>
>>>
>>>> Thanks,
>>>> Prasad
>>>>
>>>>
>>>> *Stack trace*
>>>>
>>>>
>>>> "pool-1-thread-10" #141 prio=5 os_prio=0 tid=0x0000000001513000
>>>> nid=0x6749 in Object.wait() [0x00007fdab6ff4000]
>>>>     java.lang.Thread.State: TIMED_WAITING (on object monitor)
>>>>          at java.lang.Object.wait(Native Method)
>>>>          at
>>>>
>>>> org.eclipse.jetty.io.nio.SelectChannelEndPoint.blockWritable(SelectChannelEndPoint.java:494)
>>>>          - locked <0x00000006c50d9b48> (a
>>>> org.eclipse.jetty.io.nio.SelectChannelEndPoint)
>>>>          at
>>>>
>>>> org.eclipse.jetty.io.nio.SslConnection$SslEndPoint.blockWritable(SslConnection.java:723)
>>>>          at
>>>>
>>>> org.eclipse.jetty.websocket.WebSocketGeneratorRFC6455.flush(WebSocketGeneratorRFC6455.java:248)
>>>>          at
>>>>
>>>> org.eclipse.jetty.websocket.WebSocketGeneratorRFC6455.addFrame(WebSocketGeneratorRFC6455.java:114)
>>>>          at
>>>>
>>>> org.eclipse.jetty.websocket.WebSocketConnectionRFC6455$WSFrameConnection.sendMessage(WebSocketConnectionRFC6455.java:439)
>>>>          at
>>>> org.apache.zeppelin.socket.NotebookSocket.send(NotebookSocket.java:89)
>>>>          at
>>>>
>>>> org.apache.zeppelin.socket.NotebookServer.broadcast(NotebookServer.java:286)
>>>>          - locked <0x00000006c3a1cd08> (a java.util.HashMap)
>>>>          at
>>>>
>>>> org.apache.zeppelin.socket.NotebookServer.broadcastNote(NotebookServer.java:370)
>>>>          at
>>>>
>>>> org.apache.zeppelin.socket.NotebookServer$ParagraphJobListener.afterStatusChange(NotebookServer.java:945)
>>>>          at org.apache.zeppelin.scheduler.Job.setStatus(Job.java:143)
>>>>          at
>>>>
>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.afterStatusChange(RemoteScheduler.java:379)
>>>>          at
>>>>
>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller.getStatus(RemoteScheduler.java:261)
>>>>          - locked <0x00000006c5885178> (a
>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobStatusPoller)
>>>>          at
>>>>
>>>> org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:335)
>>>>          at
>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>          at
>>>>
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>>>>          at
>>>>
>>>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>>>          at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>          at
>>>>
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>          at java.lang.Thread.run(Thread.java:745)
>>>>
>>>

Reply via email to