[ 
https://issues.apache.org/jira/browse/SPARK-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14163859#comment-14163859
 ] 

Chen Song commented on SPARK-3633:
----------------------------------

Looks like we have addressed fetch failure caused by "Too many files open". 
Anyone has more insight on the timeout thing?

The timeout happened during the transfer of BufferAckMessage between the sender 
and receiver. To shed more light on this issue, I turned on DEBUG level logging 
and it kind of give the trace of life cycle of this event.

* On sender host, sending of the message seems healthy.
{noformat}
4/09/25 19:59:48 DEBUG ConnectionManager: Before Sending [BufferAckMessage(aid 
= 582, id = 1503, size = 9601)] to [ConnectionManagerId(receiver_host,52315)] 
connectionid: sender_host_60072_260
14/09/25 19:59:48 DEBUG ConnectionManager: Sending [BufferAckMessage(aid = 582, 
id = 1503, size = 9601)] to [ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 DEBUG SendingConnection: Added [BufferAckMessage(aid = 582, 
id = 1503, size = 9601)] to outbox for sending to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 DEBUG SendingConnection: Starting to send 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 TRACE SendingConnection: Sending chunk from 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)]
14/09/25 19:59:48 DEBUG SendingConnection: Finished sending 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] to 
[ConnectionManagerId(receiver_host,52315)] in 22 ms
{noformat}

* On receiver host, receiving of the message seems stalled for 8 minutes 
(14/09/25 19:59:48, 14/09/25 20:07:14). And timeout exception was thrown in 
between.
{noformat}
14/09/25 19:59:48 DEBUG ReceivingConnection: Starting to receive 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 19:59:48 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 19:59:48 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 19:59:48 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 19:59:48 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 TRACE ReceivingConnection: Receiving chunk of 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 DEBUG ReceivingConnection: Finished receiving 
[BufferAckMessage(aid = 582, id = 1503, size = 9601)] from 
[ConnectionManagerId(sender_host,60072)] in 445535 ms
14/09/25 20:07:14 DEBUG ConnectionManager: Received [BufferAckMessage(aid = 
582, id = 1503, size = 9601)] from [ConnectionManagerId(sender_host,60072)]
14/09/25 20:07:14 DEBUG ConnectionManager: Handling [BufferAckMessage(aid = 
582, id = 1503, size = 9601)] from [ConnectionManagerId(sender_host,60072)]
{noformat}


> Fetches failure observed after SPARK-2711
> -----------------------------------------
>
>                 Key: SPARK-3633
>                 URL: https://issues.apache.org/jira/browse/SPARK-3633
>             Project: Spark
>          Issue Type: Bug
>          Components: Block Manager
>    Affects Versions: 1.1.0
>            Reporter: Nishkam Ravi
>            Priority: Critical
>
> Running a variant of PageRank on a 6-node cluster with a 30Gb input dataset. 
> Recently upgraded to Spark 1.1. The workload fails with the following error 
> message(s):
> {code}
> 14/09/19 12:10:38 WARN TaskSetManager: Lost task 51.0 in stage 2.1 (TID 552, 
> c1705.halxg.cloudera.com): FetchFailed(BlockManagerId(1, 
> c1706.halxg.cloudera.com, 49612, 0), shuffleId=3, mapId=75, reduceId=120)
> 14/09/19 12:10:38 INFO DAGScheduler: Resubmitting failed stages
> {code}
> In order to identify the problem, I carried out change set analysis. As I go 
> back in time, the error message changes to:
> {code}
> 14/09/21 12:56:54 WARN TaskSetManager: Lost task 35.0 in stage 3.0 (TID 519, 
> c1706.halxg.cloudera.com): java.io.FileNotFoundException: 
> /var/lib/jenkins/workspace/tmp/spark-local-20140921123257-68ee/1c/temp_3a1ade13-b48a-437a-a466-673995304034
>  (Too many open files)
>         java.io.FileOutputStream.open(Native Method)
>         java.io.FileOutputStream.<init>(FileOutputStream.java:221)
>         
> org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:117)
>         
> org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:185)
>         
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:197)
>         
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:145)
>         org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
>         
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:51)
>         
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>         
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>         org.apache.spark.scheduler.Task.run(Task.scala:54)
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
>         
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:745)
> {code}
> All the way until Aug 4th. Turns out the problem changeset is 4fde28c. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to