Github user markgrover commented on the pull request:

    https://github.com/apache/spark/pull/8093#issuecomment-129712836
  
    There's one thing I'd really appreciate people's thoughts on:
    
    There seems to be a race condition related to displaying the error message 
in the UI. When YARN kills a container, the Yarn Scheduler endpoint, in 
general, receives 2 events
    1. onDisconnected() (in CoarseGrainedSchedulerBackend.scala) which is 
simply called because the executor on the other end of the RPC connection is 
gone. No reason is given here since the executor may have died for any number 
of reasons. So, we simply log a generic reason - *remote Rpc client 
disassociated* which is all that we know anyways.
    2. RemoveExecutor message which explicitly tells the listener about the 
reason why the executor was killed. This is when we know the actual reason, 
where applicable - something like memory limit being exceeded.
    
    UI displays whatever is posted via SparkListenerTaskEnd on the Listener bus 
by the DAGScheduler. That is posted on the bus by handleFailedTask() which is 
called by executorLost() in TaskSetManager. This executorLost() can be called 
by (1) or (2) from above.
    
    Ok, so the race condition is really whether we get event #1 before event #2 
or vice-versa. If we get the reason-less, #1 before #2, well the UI shows a 
generic message "remote Rpc client disassociated". I have updated the message 
to be a little more helpful - *Remote Rpc client disassociated. Likely due to 
containers exceeding thresholds, or network issues. Check driver logs for 
WARNings* but it's still pretty generic.
    
    On the contrary, if we get #2 before #1, we do get a legitimate reason in 
the UI.
    
    We can't control when the events are received, since they depend on many 
other factors like network latencies and the executor distance from the driver, 
so as we stand right now, we get the correct message displayed in the UI for 
some killed containers and generic message for the rest. Keep in mind that the 
logs always show all the messages, so one can always figure out from the logs 
why and how many containers are being killed.
    
    Any thoughts on if it's worth it, in this pass, to figure out the guts of 
this? And, if so, how to do so? In my opinion, this change as it is right now, 
is definitely an incremental benefit (since before we weren't getting **any** 
errors messages in the driver log or the UI) so it makes sense to commit this, 
the way it is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to