[ https://issues.apache.org/jira/browse/SPARK-28483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weichen Xu updated SPARK-28483: ------------------------------- Description: Reproduce code: {code:java} import time from pyspark import BarrierTaskContext n = 4 def task(x): context = BarrierTaskContext.get() this = next(x) if (this % 2 == 0): time.sleep(10000) context.barrier() return [] sc.setLogLevel("INFO") sc.parallelize(list(range(n)), n).barrier().mapPartitions(task).collect(){code} Run above code in pyspark shell and then print Ctrl + C to exit the job. Get logging like: {code} 19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0. 19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0. 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled 19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 60 seconds, current barrier epoch is 0. 19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 60 seconds, current barrier epoch is 0. 19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 120 seconds, current barrier epoch is 0. 19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 120 seconds, current barrier epoch is 0. 19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 180 seconds, current barrier epoch is 0. 19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 180 seconds, current barrier epoch is 0. 19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 240 seconds, current barrier epoch is 0. 19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 240 seconds, current barrier epoch is 0. {code} was: Reproduce code: {code:java} import time from pyspark import BarrierTaskContext n = 4 def task(x): context = BarrierTaskContext.get() this = next(x) if (this % 2 == 0): time.sleep(10000) context.barrier() return [] sc.setLogLevel("INFO") sc.parallelize(list(range(n)), n).barrier().mapPartitions(task).collect(){code} Get logging like: ``` 19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0. 19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has entered the global sync, current barrier epoch is 0. 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID 0), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID 3), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID 2), reason: Stage cancelled 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) interrupted: Attempting to kill Python Worker 19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID 1), reason: Stage cancelled 19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 60 seconds, current barrier epoch is 0. 19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 60 seconds, current barrier epoch is 0. 19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 120 seconds, current barrier epoch is 0. 19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 120 seconds, current barrier epoch is 0. 19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 180 seconds, current barrier epoch is 0. 19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 180 seconds, current barrier epoch is 0. 19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) waiting under the global sync since 1549328862443, has been waiting for 240 seconds, current barrier epoch is 0. 19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) waiting under the global sync since 1549328862522, has been waiting for 240 seconds, current barrier epoch is 0. ``` > Canceling a spark job using barrier mode but tasks still being printing > messages > -------------------------------------------------------------------------------- > > Key: SPARK-28483 > URL: https://issues.apache.org/jira/browse/SPARK-28483 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.4.3 > Reporter: Weichen Xu > Priority: Major > > Reproduce code: > {code:java} > import time > from pyspark import BarrierTaskContext > n = 4 > def task(x): > context = BarrierTaskContext.get() > this = next(x) > if (this % 2 == 0): > time.sleep(10000) > context.barrier() > return [] > sc.setLogLevel("INFO") > sc.parallelize(list(range(n)), > n).barrier().mapPartitions(task).collect(){code} > Run above code in pyspark shell and then print Ctrl + C to exit the job. > Get logging like: > {code} > 19/02/05 01:07:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) has > entered the global sync, current barrier epoch is 0. > 19/02/05 01:07:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) has > entered the global sync, current barrier epoch is 0. > 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 2.0 in stage > 0.0 (TID 2), reason: Stage cancelled > 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 0.0 in stage > 0.0 (TID 0), reason: Stage cancelled > 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 1.0 in stage > 0.0 (TID 1), reason: Stage cancelled > 19/02/05 01:07:47 INFO Executor: Executor is trying to kill task 3.0 in stage > 0.0 (TID 3), reason: Stage cancelled > 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 0.0 in stage 0 (TID 0) > interrupted: Attempting to kill Python Worker > 19/02/05 01:07:50 INFO Executor: Executor killed task 0.0 in stage 0.0 (TID > 0), reason: Stage cancelled > 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 3.3 in stage 0 (TID 3) > interrupted: Attempting to kill Python Worker > 19/02/05 01:07:50 INFO Executor: Executor killed task 3.0 in stage 0.0 (TID > 3), reason: Stage cancelled > 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 2.2 in stage 0 (TID 2) > interrupted: Attempting to kill Python Worker > 19/02/05 01:07:50 INFO Executor: Executor killed task 2.0 in stage 0.0 (TID > 2), reason: Stage cancelled > 19/02/05 01:07:50 WARN PythonRunner: Incomplete task 1.1 in stage 0 (TID 1) > interrupted: Attempting to kill Python Worker > 19/02/05 01:07:50 INFO Executor: Executor killed task 1.0 in stage 0.0 (TID > 1), reason: Stage cancelled > 19/02/05 01:08:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862443, has been waiting for 60 > seconds, current barrier epoch is 0. > 19/02/05 01:08:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862522, has been waiting for 60 > seconds, current barrier epoch is 0. > 19/02/05 01:09:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862443, has been waiting for 120 > seconds, current barrier epoch is 0. > 19/02/05 01:09:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862522, has been waiting for 120 > seconds, current barrier epoch is 0. > 19/02/05 01:10:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862443, has been waiting for 180 > seconds, current barrier epoch is 0. > 19/02/05 01:10:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862522, has been waiting for 180 > seconds, current barrier epoch is 0. > 19/02/05 01:11:42 INFO BarrierTaskContext: Task 3 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862443, has been waiting for 240 > seconds, current barrier epoch is 0. > 19/02/05 01:11:42 INFO BarrierTaskContext: Task 1 from Stage 0(Attempt 0) > waiting under the global sync since 1549328862522, has been waiting for 240 > seconds, current barrier epoch is 0. > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org