bruce xu created ZEPPELIN-3507:
----------------------------------
Summary: Restart interpreter setting may cause lock contention and
lead to ZeppelinServer return 503 code
Key: ZEPPELIN-3507
URL: https://issues.apache.org/jira/browse/ZEPPELIN-3507
Project: Zeppelin
Issue Type: Bug
Components: zeppelin-interpreter
Affects Versions: 0.7.2
Environment: centos 6, jdk 1.8
Reporter: bruce xu
Attachments: zeppelin-jstack-20180525.log
When user restart interpreter setting, it may cause ZeppelinServer return 503
code because of lock contention.
1) In restart interpreter setting operation, the *Client.recv_cancel* method
may hung for a long time because of not receiving anything from server.
However, this method will hold a lock of concurrentHashMap.
{code:java}
"qtp203777958-12450" prio=10 tid=0x00007f49b0001800 nid=0x30c5 runnable
[0x00007f4a313af000]
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:275)
at java.io.BufferedInputStream.read(BufferedInputStream.java:334)
- locked <0x0000000795526d30> (a java.io.BufferedInputStream)
at
org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:77)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_cancel(RemoteInterpreterService.java:291)
at
org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.cancel(RemoteInterpreterService.java:276)
at
org.apache.zeppelin.interpreter.remote.RemoteInterpreter.cancel(RemoteInterpreter.java:419)
at
org.apache.zeppelin.interpreter.LazyOpenInterpreter.cancel(LazyOpenInterpreter.java:106)
at org.apache.zeppelin.notebook.Paragraph.jobAbort(Paragraph.java:477)
at org.apache.zeppelin.scheduler.Job.abort(Job.java:240)
at
org.apache.zeppelin.interpreter.InterpreterSettingManager.stopJobByIntpGroup(InterpreterSettingManager.java:1019)
at
org.apache.zeppelin.interpreter.InterpreterSettingManager.stopJobAllInterpreter(InterpreterSettingManager.java:1009)
at
org.apache.zeppelin.interpreter.InterpreterSettingManager.restart(InterpreterSettingManager.java:963)
- locked <0x000000078031cb78> (a java.util.concurrent.ConcurrentHashMap)
at
org.apache.zeppelin.rest.InterpreterRestApi.restartSetting(InterpreterRestApi.java:191)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.cxf.service.invoker.AbstractInvoker.performInvocation(AbstractInvoker.java:180)
at
org.apache.cxf.service.invoker.AbstractInvoker.invoke(AbstractInvoker.java:96)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:205)
at org.apache.cxf.jaxrs.JAXRSInvoker.invoke(JAXRSInvoker.java:102)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor$1.run(ServiceInvokerInterceptor.java:58)
at
org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:94)
{code}
2) So what is ConcurrentHashMap?
>From the stack information and source code, I find that the lock is
>*interpreterSetting*.
{code:java}
public void restart(String settingId, String user) {
synchronized (interpreterSettings) {
InterpreterSetting intpsetting = interpreterSettings.get(settingId);
// Check if dependency in specified path is changed
// If it did, overwrite old dependency jar with new one
copyDependenciesFromLocalPath(intpsetting);
if (intpsetting != null) {
stopJobAllInterpreter(intpsetting, user);
intpsetting.closeAndRmoveAllInterpreterGroups(user);
} else {
throw new InterpreterException("Interpreter setting id " + settingId
+ " not found");
}
}
}
{code}
3) How can this lock cause lock contention?
Because *NotebookServer* has a *onWebSocketText* method which will try to get
the lock of InterpreterSetting object.
{code:java}
"qtp203777958-12456" prio=10 tid=0x00007f49bc006800 nid=0x46ab waiting for
monitor entry [0x00007f4a30a6f000]
java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.zeppelin.interpreter.InterpreterSettingManager.get(InterpreterSettingManager.java:1035)
- waiting to lock <0x000000078031cb78> (a
java.util.concurrent.ConcurrentHashMap)
at
org.apache.zeppelin.interpreter.InterpreterSettingManager.getInterpreterSettings(InterpreterSettingManager.java:450)
at
org.apache.zeppelin.socket.NotebookServer.sendAllAngularObjects(NotebookServer.java:2115)
at
org.apache.zeppelin.socket.NotebookServer.sendNote(NotebookServer.java:724)
at
org.apache.zeppelin.socket.NotebookServer.onMessage(NotebookServer.java:220)
at
org.apache.zeppelin.socket.NotebookSocket.onWebSocketText(NotebookSocket.java:59)
at
org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onTextMessage(JettyListenerEventDriver.java:128)
at
org.eclipse.jetty.websocket.common.message.SimpleTextMessage.messageComplete(SimpleTextMessage.java:69)
at
org.eclipse.jetty.websocket.common.events.AbstractEventDriver.appendMessage(AbstractEventDriver.java:65)
at
org.eclipse.jetty.websocket.common.events.JettyListenerEventDriver.onTextFrame(JettyListenerEventDriver.java:122)
at
org.eclipse.jetty.websocket.common.events.AbstractEventDriver.incomingFrame(AbstractEventDriver.java:161)
at
org.eclipse.jetty.websocket.common.WebSocketSession.incomingFrame(WebSocketSession.java:309)
at
org.eclipse.jetty.websocket.common.extensions.ExtensionStack.incomingFrame(ExtensionStack.java:214)
at
org.eclipse.jetty.websocket.common.Parser.notifyFrame(Parser.java:220)
at org.eclipse.jetty.websocket.common.Parser.parse(Parser.java:258)
at
org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.readParse(AbstractWebSocketConnection.java:632)
at
org.eclipse.jetty.websocket.common.io.AbstractWebSocketConnection.onFillable(AbstractWebSocketConnection.java:480)
at
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:544)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
{code}
When there is a lot of threads trying to get the lock, it will cause lock
contention and slow down the ZeppelinServer performance until 503 code returned.
The jstack log file is attached below.
[^zeppelin-jstack-20180525.log]
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)