Github user sarutak commented on the pull request:

    https://github.com/apache/spark/pull/1619#issuecomment-50442727
  
    @witgo @pwendell I have already noticed there is not a configuration for 
timeout for ConnectionManager, but the timeout for ConnectionManager does not 
resolve this issue because the channel used by receiving ack is implemented as 
non blocking I.O and SO_TIMEOUT is effects read after establishing connection. 
So, if remote executor hangs, it cannot establish connections with fetching 
executors.
    
    Additionally, BasicBlockFetcherIterator is wait on LinkedBlockingQueue#take 
(result.take) so we should set FetchResult object which size is -1 to result 
queue of BasicBlockFetcherIterator.
    (FetchResult which size is -1 means fetch failed)
    
    I think remote errors can be classified following 2 cases.
    
    1) Remote Executor hang
    In this case, we need timeout for Fetch Request (Not read timeout)
    I'm trying to resolve this case in https://github.com/apache/spark/pull/1632
    
    2) Remote Executor not hang but error occurred
    In this case, remote executor should send message which means error 
occurred in remote Executor.
    I'm trying to resolve this case in https://github.com/apache/spark/pull/1490
    This is ongoing.
    Can anyone review this too? 
    
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to