[ https://issues.apache.org/jira/browse/SPARK-33669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17244456#comment-17244456 ]
Apache Spark commented on SPARK-33669: -------------------------------------- User 'sqlwindspeaker' has created a pull request for this issue: https://github.com/apache/spark/pull/30617 > Wrong error message from YARN application state monitor when sc.stop in yarn > client mode > ---------------------------------------------------------------------------------------- > > Key: SPARK-33669 > URL: https://issues.apache.org/jira/browse/SPARK-33669 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 2.4.3, 3.0.1 > Reporter: Su Qilong > Priority: Minor > > For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries > to interrupt Yarn application monitor thread. In MonitorThread.run() it > catches InterruptedException to gracefully response to stopping request. > But client.monitorApplication method also throws InterruptedIOException when > the hadoop rpc call is calling. In this case, MonitorThread will not know it > is interrupted, a Yarn App failed is returned with "Failed to contact YARN > for application xxxxx; YARN application has exited unexpectedly with state > xxxxx" is logged with error level. which confuse user a lot. > We Should take considerate InterruptedIOException here to make it the same > behavior with InterruptedException. > {code:java} > private class MonitorThread extends Thread { > private var allowInterrupt = true > override def run() { > try { > val YarnAppReport(_, state, diags) = > client.monitorApplication(appId.get, logApplicationReport = false) > logError(s"YARN application has exited unexpectedly with state $state! > " + > "Check the YARN application logs for more details.") > diags.foreach { err => > logError(s"Diagnostics message: $err") > } > allowInterrupt = false > sc.stop() > } catch { > case e: InterruptedException => logInfo("Interrupting monitor thread") > } > } > > {code} > {code:java} > // wrong error message > 2020-12-05 03:06:58,000 ERROR [YARN application state monitor]: > org.apache.spark.deploy.yarn.Client(91) - Failed to contact YARN for > application application_1605868815011_1154961. > java.io.InterruptedIOException: Call interrupted > at org.apache.hadoop.ipc.Client.call(Client.java:1466) > at org.apache.hadoop.ipc.Client.call(Client.java:1409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) > at com.sun.proxy.$Proxy38.getApplicationReport(Unknown Source) > at > org.apache.hadoop.yarn.api.impl.pb.client.ApplicationClientProtocolPBClientImpl.getApplicationReport(ApplicationClientProtocolPBClientImpl.java:187) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy39.getApplicationReport(Unknown Source) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.getApplicationReport(YarnClientImpl.java:408) > at > org.apache.spark.deploy.yarn.Client.getApplicationReport(Client.scala:327) > at > org.apache.spark.deploy.yarn.Client.monitorApplication(Client.scala:1039) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:116) > 2020-12-05 03:06:58,000 ERROR [YARN application state monitor]: > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend(70) - YARN > application has exited unexpectedly with state FAILED! Check the YARN > application logs for more details. > 2020-12-05 03:06:58,001 ERROR [YARN application state monitor]: > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend(70) - > Diagnostics message: Failed to contact YARN for application > application_1605868815011_1154961. > {code} > > {code:java} > // hadoop ipc code > public Writable call(RPC.RpcKind rpcKind, Writable rpcRequest, > ConnectionId remoteId, int serviceClass, > AtomicBoolean fallbackToSimpleAuth) throws IOException { > final Call call = createCall(rpcKind, rpcRequest); > Connection connection = getConnection(remoteId, call, serviceClass, > fallbackToSimpleAuth); > try { > connection.sendRpcRequest(call); // send the rpc request > } catch (RejectedExecutionException e) { > throw new IOException("connection has been closed", e); > } catch (InterruptedException e) { > Thread.currentThread().interrupt(); > LOG.warn("interrupted waiting to send rpc request to server", e); > throw new IOException(e); > } > synchronized (call) { > while (!call.done) { > try { > call.wait(); // wait for the result > } catch (InterruptedException ie) { > Thread.currentThread().interrupt(); > throw new InterruptedIOException("Call interrupted"); > } > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org