[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Description: We have a long running thriftserver, which we found memory leak happened. One of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! The event queue size in our prod env is set to very large to avoid message drop, but we still find the message drop in log. And the event processing time is very long , event is accumulated in queue. In heap dump we found LiveExecutor instances number is also become very huge. After check the heap dump, Finally we found the reason. !image-2022-11-18-11-01-57-435.png! The reason is: For a shuffle map stage tasks, if a executor lost happen, the finished task will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" in TaskSetManager.scala, this will cause the activeTask in AppStatusListner's liveStage become negative {code:java} override def executorLost(execId: String, host: String, reason: ExecutorLossReason): Unit = { // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage, // and we are not using an external shuffle server which could serve the shuffle outputs. // The reason is the next stage wouldn't be able to fetch the data from this dead executor // so we would need to rerun these tasks on other executors. if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && !isZombie) { for ((tid, info) <- taskInfos if info.executorId == execId) { val index = info.index // We may have a running task whose partition has been marked as successful, // this partition has another task completed in another stage attempt. // We treat it as a running task and will call handleFailedTask later. if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) { successful(index) = false copiesRunning(index) -= 1 tasksSuccessful -= 1 addPendingTask(index) // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our // stage finishes when a total of tasks.size tasks finish. sched.dagScheduler.taskEnded( tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) } } }{code} !image-2022-11-18-11-09-34-760.png! if liveStage activeTask is negative, it will never be removed, thus cause the executor moved to deadExecutors will never to removed, cause it need to check there is no stage submission less than its remove time before removed. {code:java} /** Was the specified executor active for any currently live stages? */ private def isExecutorActiveForLiveStages(exec: LiveExecutor): Boolean = { liveStages.values.asScala.exists { stage => stage.info.submissionTime.getOrElse(0L) < exec.removeTime.getTime } } override def onStageCompleted(event: SparkListenerStageCompleted): Unit = { . // remove any dead executors that were not running for any currently active stages deadExecutors.retain((execId, exec) => isExecutorActiveForLiveStages(exec)) }{code} Add the corresponding logs in prod env as attachment. The resubmitted task number is equals to the activeTasks in heap dump for that stage. !image-20221113232214179.png! !image-20221113232233952.png! Hope I describe it clear, I will create a pull request later, we just ignore the resubmitted message in AppStatusListener to fix it. was: We have a long running thriftserver, which we found memory leak happened. One of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! The event queue size in our prod env is set to very large to avoid message drop, but we still find the message drop in log. And the event processing time is very long , event is accumulated in queue. In heap dump we found LiveExecutor instances number is also become very huge. After check the heap dump, Finally we found the reason. !image-2022-11-18-11-01-57-435.png! The reason is: For a shuffle map stage tasks, if a executor lost happen, the finished task will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" in TaskSetManager.scala, this will cause the activeTask in AppStatusListner's liveStage become negative {code:java} override def executorLost(execId: String, host: String, reason: ExecutorLossReason): Unit = { // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage, // and we are not using an external shuffle server which could serve the shuffle outputs. // The reason is the next stage wouldn't be able to fetch the data from this dead executor // so we would need to rerun these tasks on other executors. if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && !isZombie) { for ((tid, info) <- taskInfos if info.executorId == execId) { val index =
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Description: We have a long running thriftserver, which we found memory leak happened. One of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! The event queue size in our prod env is set to very large to avoid message drop, but we still find the message drop in log. And the event processing time is very long , event is accumulated in queue. In heap dump we found LiveExecutor instances number is also become very huge. After check the heap dump, Finally we found the reason. !image-2022-11-18-11-01-57-435.png! The reason is: For a shuffle map stage tasks, if a executor lost happen, the finished task will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" in TaskSetManager.scala, this will cause the activeTask in AppStatusListner's liveStage become negative {code:java} override def executorLost(execId: String, host: String, reason: ExecutorLossReason): Unit = { // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage, // and we are not using an external shuffle server which could serve the shuffle outputs. // The reason is the next stage wouldn't be able to fetch the data from this dead executor // so we would need to rerun these tasks on other executors. if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && !isZombie) { for ((tid, info) <- taskInfos if info.executorId == execId) { val index = info.index // We may have a running task whose partition has been marked as successful, // this partition has another task completed in another stage attempt. // We treat it as a running task and will call handleFailedTask later. if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) { successful(index) = false copiesRunning(index) -= 1 tasksSuccessful -= 1 addPendingTask(index) // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our // stage finishes when a total of tasks.size tasks finish. sched.dagScheduler.taskEnded( tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) } } }{code} !image-2022-11-18-11-09-34-760.png! if liveStage activeTask is negative, it will never be removed, thus cause the executor moved to deadExecutors will never to removed, cause it need to check there is no stage submission less than its remove time before removed. {code:java} /** Was the specified executor active for any currently live stages? */ private def isExecutorActiveForLiveStages(exec: LiveExecutor): Boolean = { liveStages.values.asScala.exists { stage => stage.info.submissionTime.getOrElse(0L) < exec.removeTime.getTime } } override def onStageCompleted(event: SparkListenerStageCompleted): Unit = { . // remove any dead executors that were not running for any currently active stages deadExecutors.retain((execId, exec) => isExecutorActiveForLiveStages(exec)) }{code} Add the corresponding logs in prod env as attachment. The resubmitted task number is equals to the activeTasks in heap dump for that stage. Hope I describe it clear, I will create a pull request later, we just ignore the resubmitted message in AppStatusListener to fix it. was: We have a long running thriftserver, which we found memory leak happened. One of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! The event queue size in our prod env is set to very large to avoid message drop, but we still find the message drop in log. And the event processing time is very long , event is accumulated in queue. In heap dump we found LiveExecutor instances number is also become very huge. After check the heap dump, Finally we found the reason. !image-2022-11-18-11-01-57-435.png! The reason is: For a shuffle map stage tasks, if a executor lost happen, the finished task will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" in TaskSetManager.scala, this will cause the activeTask in AppStatusListner's liveStage become negative {code:java} override def executorLost(execId: String, host: String, reason: ExecutorLossReason): Unit = { // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage, // and we are not using an external shuffle server which could serve the shuffle outputs. // The reason is the next stage wouldn't be able to fetch the data from this dead executor // so we would need to rerun these tasks on other executors. if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && !isZombie) { for ((tid, info) <- taskInfos if info.executorId == execId) { val index = info.index // We may have a running task whose partition
[jira] [Comment Edited] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638495#comment-17638495 ] wineternity edited comment on SPARK-41187 at 11/25/22 5:18 AM: --- Add the corresponding logs in prod env as attachment. The resubmitted task number is equals to the activeTasks in heap dump for that stage. was (Author: yimo_yym): Add the correpsonding logs in prod env: > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > Attachments: image-2022-11-18-10-57-49-230.png, > image-2022-11-18-11-01-57-435.png, image-2022-11-18-11-09-34-760.png, > image-20221113232214179.png, image-20221113232233952.png > > > We have a long running thriftserver, which we found memory leak happened. One > of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! > The event queue size in our prod env is set to very large to avoid message > drop, but we still find the message drop in log. And the event processing > time is very long , event is accumulated in queue. > In heap dump we found LiveExecutor instances number is also become very > huge. After check the heap dump, Finally we found the reason. > !image-2022-11-18-11-01-57-435.png! > The reason is: > For a shuffle map stage tasks, if a executor lost happen, the finished task > will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" > in TaskSetManager.scala, this will cause the activeTask in > AppStatusListner's liveStage become negative > {code:java} > override def executorLost(execId: String, host: String, reason: > ExecutorLossReason): Unit = { > // Re-enqueue any tasks that ran on the failed executor if this is a > shuffle map stage, > // and we are not using an external shuffle server which could serve the > shuffle outputs. > // The reason is the next stage wouldn't be able to fetch the data from > this dead executor > // so we would need to rerun these tasks on other executors. > if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && > !isZombie) { > for ((tid, info) <- taskInfos if info.executorId == execId) { > val index = info.index > // We may have a running task whose partition has been marked as > successful, > // this partition has another task completed in another stage attempt. > // We treat it as a running task and will call handleFailedTask later. > if (successful(index) && !info.running && > !killedByOtherAttempt.contains(tid)) { > successful(index) = false > copiesRunning(index) -= 1 > tasksSuccessful -= 1 > addPendingTask(index) > // Tell the DAGScheduler that this task was resubmitted so that it > doesn't think our > // stage finishes when a total of tasks.size tasks finish. > sched.dagScheduler.taskEnded( > tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) > } > } > }{code} > > !image-2022-11-18-11-09-34-760.png! > if liveStage activeTask is negative, it will never be removed, thus cause the > executor moved to deadExecutors will never to removed, cause it need to check > there is no stage submission less than its remove time before removed. > {code:java} > /** Was the specified executor active for any currently live stages? */ > private def isExecutorActiveForLiveStages(exec: LiveExecutor): Boolean = { > liveStages.values.asScala.exists { stage => > stage.info.submissionTime.getOrElse(0L) < exec.removeTime.getTime > } > } > override def onStageCompleted(event: SparkListenerStageCompleted): Unit = { > . > // remove any dead executors that were not running for any currently active > stages > deadExecutors.retain((execId, exec) => isExecutorActiveForLiveStages(exec)) > }{code} > > Hope I describe it clear, I will create a pull request later, we just ignore > the resubmitted message in AppStatusListener to fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17638495#comment-17638495 ] wineternity commented on SPARK-41187: - Add the correpsonding logs in prod env: > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > Attachments: image-2022-11-18-10-57-49-230.png, > image-2022-11-18-11-01-57-435.png, image-2022-11-18-11-09-34-760.png > > > We have a long running thriftserver, which we found memory leak happened. One > of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! > The event queue size in our prod env is set to very large to avoid message > drop, but we still find the message drop in log. And the event processing > time is very long , event is accumulated in queue. > In heap dump we found LiveExecutor instances number is also become very > huge. After check the heap dump, Finally we found the reason. > !image-2022-11-18-11-01-57-435.png! > The reason is: > For a shuffle map stage tasks, if a executor lost happen, the finished task > will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" > in TaskSetManager.scala, this will cause the activeTask in > AppStatusListner's liveStage become negative > {code:java} > override def executorLost(execId: String, host: String, reason: > ExecutorLossReason): Unit = { > // Re-enqueue any tasks that ran on the failed executor if this is a > shuffle map stage, > // and we are not using an external shuffle server which could serve the > shuffle outputs. > // The reason is the next stage wouldn't be able to fetch the data from > this dead executor > // so we would need to rerun these tasks on other executors. > if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && > !isZombie) { > for ((tid, info) <- taskInfos if info.executorId == execId) { > val index = info.index > // We may have a running task whose partition has been marked as > successful, > // this partition has another task completed in another stage attempt. > // We treat it as a running task and will call handleFailedTask later. > if (successful(index) && !info.running && > !killedByOtherAttempt.contains(tid)) { > successful(index) = false > copiesRunning(index) -= 1 > tasksSuccessful -= 1 > addPendingTask(index) > // Tell the DAGScheduler that this task was resubmitted so that it > doesn't think our > // stage finishes when a total of tasks.size tasks finish. > sched.dagScheduler.taskEnded( > tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) > } > } > }{code} > > !image-2022-11-18-11-09-34-760.png! > if liveStage activeTask is negative, it will never be removed, thus cause the > executor moved to deadExecutors will never to removed, cause it need to check > there is no stage submission less than its remove time before removed. > {code:java} > /** Was the specified executor active for any currently live stages? */ > private def isExecutorActiveForLiveStages(exec: LiveExecutor): Boolean = { > liveStages.values.asScala.exists { stage => > stage.info.submissionTime.getOrElse(0L) < exec.removeTime.getTime > } > } > override def onStageCompleted(event: SparkListenerStageCompleted): Unit = { > . > // remove any dead executors that were not running for any currently active > stages > deadExecutors.retain((execId, exec) => isExecutorActiveForLiveStages(exec)) > }{code} > > Hope I describe it clear, I will create a pull request later, we just ignore > the resubmitted message in AppStatusListener to fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Description: We have a long running thriftserver, which we found memory leak happened. One of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! The event queue size in our prod env is set to very large to avoid message drop, but we still find the message drop in log. And the event processing time is very long , event is accumulated in queue. In heap dump we found LiveExecutor instances number is also become very huge. After check the heap dump, Finally we found the reason. !image-2022-11-18-11-01-57-435.png! The reason is: For a shuffle map stage tasks, if a executor lost happen, the finished task will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" in TaskSetManager.scala, this will cause the activeTask in AppStatusListner's liveStage become negative {code:java} override def executorLost(execId: String, host: String, reason: ExecutorLossReason): Unit = { // Re-enqueue any tasks that ran on the failed executor if this is a shuffle map stage, // and we are not using an external shuffle server which could serve the shuffle outputs. // The reason is the next stage wouldn't be able to fetch the data from this dead executor // so we would need to rerun these tasks on other executors. if (isShuffleMapTasks && !env.blockManager.externalShuffleServiceEnabled && !isZombie) { for ((tid, info) <- taskInfos if info.executorId == execId) { val index = info.index // We may have a running task whose partition has been marked as successful, // this partition has another task completed in another stage attempt. // We treat it as a running task and will call handleFailedTask later. if (successful(index) && !info.running && !killedByOtherAttempt.contains(tid)) { successful(index) = false copiesRunning(index) -= 1 tasksSuccessful -= 1 addPendingTask(index) // Tell the DAGScheduler that this task was resubmitted so that it doesn't think our // stage finishes when a total of tasks.size tasks finish. sched.dagScheduler.taskEnded( tasks(index), Resubmitted, null, Seq.empty, Array.empty, info) } } }{code} !image-2022-11-18-11-09-34-760.png! if liveStage activeTask is negative, it will never be removed, thus cause the executor moved to deadExecutors will never to removed, cause it need to check there is no stage submission less than its remove time before removed. {code:java} /** Was the specified executor active for any currently live stages? */ private def isExecutorActiveForLiveStages(exec: LiveExecutor): Boolean = { liveStages.values.asScala.exists { stage => stage.info.submissionTime.getOrElse(0L) < exec.removeTime.getTime } } override def onStageCompleted(event: SparkListenerStageCompleted): Unit = { . // remove any dead executors that were not running for any currently active stages deadExecutors.retain((execId, exec) => isExecutorActiveForLiveStages(exec)) }{code} Hope I describe it clear, I will create a pull request later, we just ignore the resubmitted message in AppStatusListener to fix it. was:We have a long running thriftserver, which we found memory leak happened。One of the memory leak is like below。 > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > Attachments: image-2022-11-18-10-57-49-230.png, > image-2022-11-18-11-01-57-435.png, image-2022-11-18-11-09-34-760.png > > > We have a long running thriftserver, which we found memory leak happened. One > of the memory leak is like below. !image-2022-11-18-10-57-49-230.png! > The event queue size in our prod env is set to very large to avoid message > drop, but we still find the message drop in log. And the event processing > time is very long , event is accumulated in queue. > In heap dump we found LiveExecutor instances number is also become very > huge. After check the heap dump, Finally we found the reason. > !image-2022-11-18-11-01-57-435.png! > The reason is: > For a shuffle map stage tasks, if a executor lost happen, the finished task > will be resubmitted, and send out a taskEnd Message with reason "Resubmitted" > in TaskSetManager.scala, this will cause the activeTask in > AppStatusListner's liveStage become negative > {code:java} > override def executorLost(execId: String, host: String, reason: >
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Attachment: image-2022-11-18-11-09-34-760.png > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > Attachments: image-2022-11-18-10-57-49-230.png, > image-2022-11-18-11-01-57-435.png, image-2022-11-18-11-09-34-760.png > > > We have a long running thriftserver, which we found memory leak happened。One > of the memory leak is like below。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Attachment: image-2022-11-18-11-01-57-435.png > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > Attachments: image-2022-11-18-10-57-49-230.png, > image-2022-11-18-11-01-57-435.png > > > We have a long running thriftserver, which we found memory leak happened。One > of the memory leak is like below。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Attachment: image-2022-11-18-10-57-49-230.png > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > Attachments: image-2022-11-18-10-57-49-230.png > > > We have a long running thriftserver, which we found memory leak happened。One > of the memory leak is like below。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Description: We have a long running thriftserver, which we found memory leak happened。One of the memory leak is like below。 (was: We have a long running thriftserver, which we found memory leak happened。) > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > > We have a long running thriftserver, which we found memory leak happened。One > of the memory leak is like below。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
[ https://issues.apache.org/jira/browse/SPARK-41187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-41187: Description: We have a long running thriftserver, which we found memory leak happened。 > [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen > > > Key: SPARK-41187 > URL: https://issues.apache.org/jira/browse/SPARK-41187 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2, 3.2.0, 3.1.3, 3.3.1 >Reporter: wineternity >Priority: Major > > We have a long running thriftserver, which we found memory leak happened。 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41187) [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen
wineternity created SPARK-41187: --- Summary: [Core] LiveExecutor MemoryLeak in AppStatusListener when ExecutorLost happen Key: SPARK-41187 URL: https://issues.apache.org/jira/browse/SPARK-41187 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.1, 3.1.3, 3.2.0, 3.1.2 Reporter: wineternity -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33325) Spark executors pod are not shutting down when losing driver connection
[ https://issues.apache.org/jira/browse/SPARK-33325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453536#comment-17453536 ] wineternity edited comment on SPARK-33325 at 12/5/21, 7:20 AM: --- Fixed in https://issues.apache.org/jira/browse/SPARK-36532 was (Author: yimo_yym): Fixed in https://issues.apache.org/jira/browse/SPARK-36532 > Spark executors pod are not shutting down when losing driver connection > --- > > Key: SPARK-33325 > URL: https://issues.apache.org/jira/browse/SPARK-33325 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Hadrien Kohl >Priority: Major > > In situations where the executors lose contact with the driver, the java > process does not die. I am looking at what on the kubernetes cluster could > prevent proper clean-up. > The spark driver is started in it's own pod in client mode (pyspark shell > started by jupyter). I works fine most of the time but if the driver process > crashes (OOM or kill signal for instance) the executor complains about the > connection reset by peer and then hangs. > Here's the log from an executor pod that hangs: > {code:java} > 20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from > /10.17.0.152:37161 > java.io.IOException: Connection reset by peer > at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) > at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) > at java.base/sun.nio.ch.IOUtil.read(Unknown Source) > at java.base/sun.nio.ch.IOUtil.read(Unknown Source) > at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) > at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > 20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting > due to : Driver 10.17.0.152:37161 disassociated! Shutting down. > 20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared > 20/11/03 07:35:31 INFO BlockManager: BlockManager stopped > {code} > When start a shell in the pod I can see the process are still running: > {code:java} > UID PIDPPID CSZ RSS PSR STIME TTY TIME CMD > 185 125 0 0 5045 3968 2 10:07 pts/000:00:00 /bin/bash > 185 166 125 0 9019 3364 1 10:39 pts/000:00:00 \_ ps > -AF --forest > 1851 0 0 1130 768 0 07:34 ?00:00:00 > /usr/bin/tini -s -- /opt/java/openjdk/ > 185 14 1 0 1935527 493976 3 07:34 ? 00:00:21 > /opt/java/openjdk/bin/java -Dspark.dri > {code} > Here's the full command used to start the executor: > {code:java} > /opt/java/openjdk/ > bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: > org.apache.spark.executor.CoarseG > rainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core > s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151 > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33325) Spark executors pod are not shutting down when losing driver connection
[ https://issues.apache.org/jira/browse/SPARK-33325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17453536#comment-17453536 ] wineternity commented on SPARK-33325: - Fixed in https://issues.apache.org/jira/browse/SPARK-36532 > Spark executors pod are not shutting down when losing driver connection > --- > > Key: SPARK-33325 > URL: https://issues.apache.org/jira/browse/SPARK-33325 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.1 >Reporter: Hadrien Kohl >Priority: Major > > In situations where the executors lose contact with the driver, the java > process does not die. I am looking at what on the kubernetes cluster could > prevent proper clean-up. > The spark driver is started in it's own pod in client mode (pyspark shell > started by jupyter). I works fine most of the time but if the driver process > crashes (OOM or kill signal for instance) the executor complains about the > connection reset by peer and then hangs. > Here's the log from an executor pod that hangs: > {code:java} > 20/11/03 07:35:30 WARN TransportChannelHandler: Exception in connection from > /10.17.0.152:37161 > java.io.IOException: Connection reset by peer > at java.base/sun.nio.ch.FileDispatcherImpl.read0(Native Method) > at java.base/sun.nio.ch.SocketDispatcher.read(Unknown Source) > at java.base/sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source) > at java.base/sun.nio.ch.IOUtil.read(Unknown Source) > at java.base/sun.nio.ch.IOUtil.read(Unknown Source) > at java.base/sun.nio.ch.SocketChannelImpl.read(Unknown Source) > at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:253) > at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1133) > at > io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > 20/11/03 07:35:30 ERROR CoarseGrainedExecutorBackend: Executor self-exiting > due to : Driver 10.17.0.152:37161 disassociated! Shutting down. > 20/11/03 07:35:31 INFO MemoryStore: MemoryStore cleared > 20/11/03 07:35:31 INFO BlockManager: BlockManager stopped > {code} > When start a shell in the pod I can see the process are still running: > {code:java} > UID PIDPPID CSZ RSS PSR STIME TTY TIME CMD > 185 125 0 0 5045 3968 2 10:07 pts/000:00:00 /bin/bash > 185 166 125 0 9019 3364 1 10:39 pts/000:00:00 \_ ps > -AF --forest > 1851 0 0 1130 768 0 07:34 ?00:00:00 > /usr/bin/tini -s -- /opt/java/openjdk/ > 185 14 1 0 1935527 493976 3 07:34 ? 00:00:21 > /opt/java/openjdk/bin/java -Dspark.dri > {code} > Here's the full command used to start the executor: > {code:java} > /opt/java/openjdk/ > bin/java -Dspark.driver.port=37161 -Xms4g -Xmx4g -cp :/opt/spark/jars/*: > org.apache.spark.executor.CoarseG > rainedExecutorBackend --driver-url > spark://CoarseGrainedScheduler@10.17.0.152:37161 --executor-id 1 --core > s 1 --app-id spark-application-1604388891044 --hostname 10.17.2.151 > {code} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36860) Create the external hive table for HBase failed
[ https://issues.apache.org/jira/browse/SPARK-36860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-36860: Attachment: image-2021-09-27-14-25-28-900.png > Create the external hive table for HBase failed > > > Key: SPARK-36860 > URL: https://issues.apache.org/jira/browse/SPARK-36860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: wineternity >Priority: Major > Attachments: image-2021-09-27-14-18-10-910.png, > image-2021-09-27-14-25-28-900.png > > > We use follow sql to create hive external table , which read from hbase > {code:java} > CREATE EXTERNAL TABLE if not exists dev.sanyu_spotlight_headline_material( >rowkey string COMMENT 'HBase主键', >content string COMMENT '图文正文') > USING HIVE > ROW FORMAT SERDE >'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY >'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( >'hbase.columns.mapping'=':key, cf1:content' > ) > TBLPROPERTIES ( >'hbase.table.name'='spotlight_headline_material' > ); > {code} > But the sql failed in Spark 3.1.2, which throw this exception > {code:java} > 21/09/27 11:44:24 INFO scheduler.DAGScheduler: Asked to cancel job group > 26d7459f-7b58-4c18-9939-5f2737525ff2 > 21/09/27 11:44:24 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query with 26d7459f-7b58-4c18-9939-5f2737525ff2, currentState > RUNNING, > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' and STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key, > cf1:content')(line 5, pos 0) > {code} > this check was introduced from this change: > [https://github.com/apache/spark/pull/28026] > > Could anyone gave the introduction how to create the external table for hbase > in spark3 now ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36860) Create the external hive table for HBase failed
[ https://issues.apache.org/jira/browse/SPARK-36860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420497#comment-17420497 ] wineternity commented on SPARK-36860: - thanks, [~sarutak] , may I ask why spark doesn't support creating Hive table using storage handlers? it seems spark supported the stored by syntax, the data is already in CreateFileFormatContext. !image-2021-09-27-14-18-10-910.png! but the validateRowFormatFileFormat in AstBuilder later only checks the fileformat provided in stored as syntax, and as fileformat is null here, it throw out an exception. Maybe stored by clause can be fixed by fix this check? !image-2021-09-27-14-25-28-900.png! > Create the external hive table for HBase failed > > > Key: SPARK-36860 > URL: https://issues.apache.org/jira/browse/SPARK-36860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: wineternity >Priority: Major > Attachments: image-2021-09-27-14-18-10-910.png > > > We use follow sql to create hive external table , which read from hbase > {code:java} > CREATE EXTERNAL TABLE if not exists dev.sanyu_spotlight_headline_material( >rowkey string COMMENT 'HBase主键', >content string COMMENT '图文正文') > USING HIVE > ROW FORMAT SERDE >'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY >'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( >'hbase.columns.mapping'=':key, cf1:content' > ) > TBLPROPERTIES ( >'hbase.table.name'='spotlight_headline_material' > ); > {code} > But the sql failed in Spark 3.1.2, which throw this exception > {code:java} > 21/09/27 11:44:24 INFO scheduler.DAGScheduler: Asked to cancel job group > 26d7459f-7b58-4c18-9939-5f2737525ff2 > 21/09/27 11:44:24 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query with 26d7459f-7b58-4c18-9939-5f2737525ff2, currentState > RUNNING, > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' and STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key, > cf1:content')(line 5, pos 0) > {code} > this check was introduced from this change: > [https://github.com/apache/spark/pull/28026] > > Could anyone gave the introduction how to create the external table for hbase > in spark3 now ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36860) Create the external hive table for HBase failed
[ https://issues.apache.org/jira/browse/SPARK-36860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wineternity updated SPARK-36860: Attachment: image-2021-09-27-14-18-10-910.png > Create the external hive table for HBase failed > > > Key: SPARK-36860 > URL: https://issues.apache.org/jira/browse/SPARK-36860 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: wineternity >Priority: Major > Attachments: image-2021-09-27-14-18-10-910.png > > > We use follow sql to create hive external table , which read from hbase > {code:java} > CREATE EXTERNAL TABLE if not exists dev.sanyu_spotlight_headline_material( >rowkey string COMMENT 'HBase主键', >content string COMMENT '图文正文') > USING HIVE > ROW FORMAT SERDE >'org.apache.hadoop.hive.hbase.HBaseSerDe' > STORED BY >'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ( >'hbase.columns.mapping'=':key, cf1:content' > ) > TBLPROPERTIES ( >'hbase.table.name'='spotlight_headline_material' > ); > {code} > But the sql failed in Spark 3.1.2, which throw this exception > {code:java} > 21/09/27 11:44:24 INFO scheduler.DAGScheduler: Asked to cancel job group > 26d7459f-7b58-4c18-9939-5f2737525ff2 > 21/09/27 11:44:24 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query with 26d7459f-7b58-4c18-9939-5f2737525ff2, currentState > RUNNING, > org.apache.spark.sql.catalyst.parser.ParseException: > Operation not allowed: Unexpected combination of ROW FORMAT SERDE > 'org.apache.hadoop.hive.hbase.HBaseSerDe' and STORED BY > 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key, > cf1:content')(line 5, pos 0) > {code} > this check was introduced from this change: > [https://github.com/apache/spark/pull/28026] > > Could anyone gave the introduction how to create the external table for hbase > in spark3 now ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36860) Create the external hive table for HBase failed
wineternity created SPARK-36860: --- Summary: Create the external hive table for HBase failed Key: SPARK-36860 URL: https://issues.apache.org/jira/browse/SPARK-36860 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2 Reporter: wineternity We use follow sql to create hive external table , which read from hbase {code:java} CREATE EXTERNAL TABLE if not exists dev.sanyu_spotlight_headline_material( rowkey string COMMENT 'HBase主键', content string COMMENT '图文正文') USING HIVE ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ( 'hbase.columns.mapping'=':key, cf1:content' ) TBLPROPERTIES ( 'hbase.table.name'='spotlight_headline_material' ); {code} But the sql failed in Spark 3.1.2, which throw this exception {code:java} 21/09/27 11:44:24 INFO scheduler.DAGScheduler: Asked to cancel job group 26d7459f-7b58-4c18-9939-5f2737525ff2 21/09/27 11:44:24 ERROR thriftserver.SparkExecuteStatementOperation: Error executing query with 26d7459f-7b58-4c18-9939-5f2737525ff2, currentState RUNNING, org.apache.spark.sql.catalyst.parser.ParseException: Operation not allowed: Unexpected combination of ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe' and STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITHSERDEPROPERTIES('hbase.columns.mapping'=':key, cf1:content')(line 5, pos 0) {code} this check was introduced from this change: [https://github.com/apache/spark/pull/28026] Could anyone gave the introduction how to create the external table for hbase in spark3 now ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org