[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15899334#comment-15899334 ] Apache Spark commented on SPARK-19831: -- User 'hustfxj' has created a pull request for this issue: https://github.com/apache/spark/pull/17189 > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not receive the heartbeat master by *receive* method. > Because any other rpc message may block the *receive* method. Then worker > won't receive the heartbeat message timely. So it had better send the > heartbeat master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15898704#comment-15898704 ] hustfxj commented on SPARK-19831: - [~zsxwing]. I only find the code which handles *ApplicationFinished* message is slow until now. So I also think such codes should be run in a separate thread. > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not send the heartbeat master by Rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15897789#comment-15897789 ] Shixiong Zhu commented on SPARK-19831: -- Cores running in the receive method should be quick. If that's not true, such codes should be run in a separate thread. Which part of codes in Worker did you find is very slow? > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not send the heartbeat master by Rpc channel. Because any > other rpc message may block the rpc channel. It had better send the heartbeat > master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org