[ https://issues.apache.org/jira/browse/SPARK-32937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198867#comment-17198867 ]
Holden Karau commented on SPARK-32937: -------------------------------------- So there are additional proposed changes to this chunk of code being reviewed. I'll go ahead and put a pending -1 on them because I don't want us in the situation where it's no longer a simple revert to fix this test suite. If it's not resolved by Monday I'll go ahead and make the revert myself. > DecomissionSuite in k8s integration tests is failing. > ----------------------------------------------------- > > Key: SPARK-32937 > URL: https://issues.apache.org/jira/browse/SPARK-32937 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.1.0 > Reporter: Prashant Sharma > Priority: Major > > Logs from the failing test, copied from jenkins. As of now, it is always > failing. > {code} > - Test basic decommissioning *** FAILED *** > The code passed to eventually never returned normally. Attempted 182 times > over 3.00377927275 minutes. Last failure message: "++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' 3 == 2 ']' > + '[' 3 == 3 ']' > ++ python3 -V > + pyv3='Python 3.7.3' > + export PYTHON_VERSION=3.7.3 > + PYTHON_VERSION=3.7.3 > + export PYSPARK_PYTHON=python3 > + PYSPARK_PYTHON=python3 > + export PYSPARK_DRIVER_PYTHON=python3 > + PYSPARK_DRIVER_PYTHON=python3 > + '[' -n '' ']' > + '[' -z ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file > /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner > local:///opt/spark/tests/decommissioning.py > 20/09/17 11:06:56 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > Starting decom test > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 20/09/17 11:06:56 INFO SparkContext: Running Spark version 3.1.0-SNAPSHOT > 20/09/17 11:06:57 INFO ResourceUtils: > ============================================================== > 20/09/17 11:06:57 INFO ResourceUtils: No custom resources configured for > spark.driver. > 20/09/17 11:06:57 INFO ResourceUtils: > ============================================================== > 20/09/17 11:06:57 INFO SparkContext: Submitted application: PyMemoryTest > 20/09/17 11:06:57 INFO ResourceProfile: Default ResourceProfile created, > executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , > memory -> name: memory, amount: 1024, script: , vendor: ), task resources: > Map(cpus -> name: cpus, amount: 1.0) > 20/09/17 11:06:57 INFO ResourceProfile: Limiting resource is cpus at 1 > tasks per executor > 20/09/17 11:06:57 INFO ResourceProfileManager: Added ResourceProfile id: 0 > 20/09/17 11:06:57 INFO SecurityManager: Changing view acls to: 185,jenkins > 20/09/17 11:06:57 INFO SecurityManager: Changing modify acls to: 185,jenkins > 20/09/17 11:06:57 INFO SecurityManager: Changing view acls groups to: > 20/09/17 11:06:57 INFO SecurityManager: Changing modify acls groups to: > 20/09/17 11:06:57 INFO SecurityManager: SecurityManager: authentication > enabled; ui acls disabled; users with view permissions: Set(185, jenkins); > groups with view permissions: Set(); users with modify permissions: Set(185, > jenkins); groups with modify permissions: Set() > 20/09/17 11:06:57 INFO Utils: Successfully started service 'sparkDriver' on > port 7078. > 20/09/17 11:06:57 INFO SparkEnv: Registering MapOutputTracker > 20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMaster > 20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: Using > org.apache.spark.storage.DefaultTopologyMapper for getting topology > information > 20/09/17 11:06:57 INFO BlockManagerMasterEndpoint: > BlockManagerMasterEndpoint up > 20/09/17 11:06:57 INFO SparkEnv: Registering BlockManagerMasterHeartbeat > 20/09/17 11:06:57 INFO DiskBlockManager: Created local directory at > /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/blockmgr-3bd403d0-6689-46be-997e-5bc699ecefd3 > 20/09/17 11:06:57 INFO MemoryStore: MemoryStore started with capacity 593.9 > MiB > 20/09/17 11:06:57 INFO SparkEnv: Registering OutputCommitCoordinator > 20/09/17 11:06:58 INFO Utils: Successfully started service 'SparkUI' on > port 4040. > 20/09/17 11:06:58 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at > http://spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:4040 > 20/09/17 11:06:58 INFO SparkKubernetesClientFactory: Auto-configuring K8S > client using current context from users K8S config file > 20/09/17 11:06:59 INFO ExecutorPodsAllocator: Going to request 3 executors > from Kubernetes. > 20/09/17 11:06:59 INFO KubernetesClientUtils: Spark configuration files > loaded from Some(/opt/spark/conf) : > 20/09/17 11:07:00 INFO BasicExecutorFeatureStep: Adding decommission script > to lifecycle > 20/09/17 11:07:00 INFO Utils: Successfully started service > 'org.apache.spark.network.netty.NettyBlockTransferService' on port 7079. > 20/09/17 11:07:00 INFO NettyBlockTransferService: Server created on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > 20/09/17 11:07:00 INFO BlockManager: Using > org.apache.spark.storage.RandomBlockReplicationPolicy for block replication > policy > 20/09/17 11:07:00 INFO BlockManagerMaster: Registering BlockManager > BlockManagerId(driver, > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc, > 7079, None) > 20/09/17 11:07:00 INFO BlockManagerMasterEndpoint: Registering block > manager > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > with 593.9 MiB RAM, BlockManagerId(driver, > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc, > 7079, None) > 20/09/17 11:07:00 INFO BlockManagerMaster: Registered BlockManager > BlockManagerId(driver, > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc, > 7079, None) > 20/09/17 11:07:00 INFO KubernetesClientUtils: Spark configuration files > loaded from Some(/opt/spark/conf) : > 20/09/17 11:07:00 INFO BlockManager: Initialized BlockManager: > BlockManagerId(driver, > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc, > 7079, None) > 20/09/17 11:07:00 INFO BasicExecutorFeatureStep: Adding decommission script > to lifecycle > 20/09/17 11:07:00 INFO KubernetesClientUtils: Spark configuration files > loaded from Some(/opt/spark/conf) : > 20/09/17 11:07:00 INFO BasicExecutorFeatureStep: Adding decommission script > to lifecycle > 20/09/17 11:07:00 INFO KubernetesClientUtils: Spark configuration files > loaded from Some(/opt/spark/conf) : > 20/09/17 11:07:05 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered > executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.6:50176) with > ID 2, ResourceProfileId 0 > 20/09/17 11:07:05 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered > executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.5:35624) with > ID 1, ResourceProfileId 0 > 20/09/17 11:07:05 INFO BlockManagerMasterEndpoint: Registering block > manager 172.17.0.6:33547 with 593.9 MiB RAM, BlockManagerId(2, 172.17.0.6, > 33547, None) > 20/09/17 11:07:05 INFO BlockManagerMasterEndpoint: Registering block > manager 172.17.0.5:46327 with 593.9 MiB RAM, BlockManagerId(1, 172.17.0.5, > 46327, None) > 20/09/17 11:07:29 INFO KubernetesClusterSchedulerBackend: SchedulerBackend > is ready for scheduling beginning after waiting > maxRegisteredResourcesWaitingTime: 30000000000(ns) > 20/09/17 11:07:30 INFO SharedState: Setting hive.metastore.warehouse.dir > ('null') to the value of spark.sql.warehouse.dir > ('file:/opt/spark/work-dir/spark-warehouse'). > 20/09/17 11:07:30 INFO SharedState: Warehouse path is > 'file:/opt/spark/work-dir/spark-warehouse'. > 20/09/17 11:07:32 INFO SparkContext: Starting job: collect at > /opt/spark/tests/decommissioning.py:44 > 20/09/17 11:07:32 INFO DAGScheduler: Registering RDD 2 (groupByKey at > /opt/spark/tests/decommissioning.py:43) as input to shuffle 0 > 20/09/17 11:07:32 INFO DAGScheduler: Got job 0 (collect at > /opt/spark/tests/decommissioning.py:44) with 5 output partitions > 20/09/17 11:07:32 INFO DAGScheduler: Final stage: ResultStage 1 (collect at > /opt/spark/tests/decommissioning.py:44) > 20/09/17 11:07:32 INFO DAGScheduler: Parents of final stage: > List(ShuffleMapStage 0) > 20/09/17 11:07:32 INFO DAGScheduler: Missing parents: List(ShuffleMapStage > 0) > 20/09/17 11:07:32 INFO DAGScheduler: Submitting ShuffleMapStage 0 > (PairwiseRDD[2] at groupByKey at /opt/spark/tests/decommissioning.py:43), > which has no missing parents > 20/09/17 11:07:32 INFO MemoryStore: Block broadcast_0 stored as values in > memory (estimated size 10.6 KiB, free 593.9 MiB) > 20/09/17 11:07:32 INFO MemoryStore: Block broadcast_0_piece0 stored as > bytes in memory (estimated size 6.5 KiB, free 593.9 MiB) > 20/09/17 11:07:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > (size: 6.5 KiB, free: 593.9 MiB) > 20/09/17 11:07:32 INFO SparkContext: Created broadcast 0 from broadcast at > DAGScheduler.scala:1348 > 20/09/17 11:07:32 INFO DAGScheduler: Submitting 5 missing tasks from > ShuffleMapStage 0 (PairwiseRDD[2] at groupByKey at > /opt/spark/tests/decommissioning.py:43) (first 15 tasks are for partitions > Vector(0, 1, 2, 3, 4)) > 20/09/17 11:07:32 INFO TaskSchedulerImpl: Adding task set 0.0 with 5 tasks > resource profile 0 > 20/09/17 11:07:32 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID > 0) (172.17.0.6, executor 2, partition 0, PROCESS_LOCAL, 7341 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:32 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID > 1) (172.17.0.5, executor 1, partition 1, PROCESS_LOCAL, 7341 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 172.17.0.5:46327 (size: 6.5 KiB, free: 593.9 MiB) > 20/09/17 11:07:32 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory > on 172.17.0.6:33547 (size: 6.5 KiB, free: 593.9 MiB) > 20/09/17 11:07:34 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID > 2) (172.17.0.5, executor 1, partition 2, PROCESS_LOCAL, 7341 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:34 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID > 1) in 1825 ms on 172.17.0.5 (executor 1) (1/5) > 20/09/17 11:07:34 INFO PythonAccumulatorV2: Connected to AccumulatorServer > at host: 127.0.0.1 port: 47109 > 20/09/17 11:07:34 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID > 3) (172.17.0.6, executor 2, partition 3, PROCESS_LOCAL, 7341 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:34 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID > 0) in 1960 ms on 172.17.0.6 (executor 2) (2/5) > 20/09/17 11:07:34 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID > 4) (172.17.0.5, executor 1, partition 4, PROCESS_LOCAL, 7341 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:34 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID > 2) in 136 ms on 172.17.0.5 (executor 1) (3/5) > 20/09/17 11:07:34 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID > 3) in 119 ms on 172.17.0.6 (executor 2) (4/5) > 20/09/17 11:07:34 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID > 4) in 117 ms on 172.17.0.5 (executor 1) (5/5) > 20/09/17 11:07:34 INFO DAGScheduler: ShuffleMapStage 0 (groupByKey at > /opt/spark/tests/decommissioning.py:43) finished in 2.352 s > 20/09/17 11:07:34 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks > have all completed, from pool > 20/09/17 11:07:34 INFO DAGScheduler: looking for newly runnable stages > 20/09/17 11:07:34 INFO DAGScheduler: running: Set() > 20/09/17 11:07:34 INFO DAGScheduler: waiting: Set(ResultStage 1) > 20/09/17 11:07:34 INFO DAGScheduler: failed: Set() > 20/09/17 11:07:34 INFO DAGScheduler: Submitting ResultStage 1 (PythonRDD[5] > at collect at /opt/spark/tests/decommissioning.py:44), which has no missing > parents > 20/09/17 11:07:34 INFO MemoryStore: Block broadcast_1 stored as values in > memory (estimated size 9.3 KiB, free 593.9 MiB) > 20/09/17 11:07:34 INFO MemoryStore: Block broadcast_1_piece0 stored as > bytes in memory (estimated size 5.4 KiB, free 593.9 MiB) > 20/09/17 11:07:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:07:34 INFO SparkContext: Created broadcast 1 from broadcast at > DAGScheduler.scala:1348 > 20/09/17 11:07:34 INFO DAGScheduler: Submitting 5 missing tasks from > ResultStage 1 (PythonRDD[5] at collect at > /opt/spark/tests/decommissioning.py:44) (first 15 tasks are for partitions > Vector(0, 1, 2, 3, 4)) > 20/09/17 11:07:34 INFO TaskSchedulerImpl: Adding task set 1.0 with 5 tasks > resource profile 0 > 20/09/17 11:07:34 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID > 5) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:34 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID > 6) (172.17.0.5, executor 1, partition 1, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on 172.17.0.6:33547 (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:07:34 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory > on 172.17.0.5:46327 (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:07:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map > output locations for shuffle 0 to 172.17.0.5:35624 > 20/09/17 11:07:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map > output locations for shuffle 0 to 172.17.0.6:50176 > 20/09/17 11:07:35 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID > 7) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:35 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID > 5) in 729 ms on 172.17.0.6 (executor 2) (1/5) > 20/09/17 11:07:35 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID > 8) (172.17.0.5, executor 1, partition 3, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:35 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID > 6) in 755 ms on 172.17.0.5 (executor 1) (2/5) > 20/09/17 11:07:35 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID > 9) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:07:35 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID > 7) in 113 ms on 172.17.0.6 (executor 2) (3/5) > 20/09/17 11:07:35 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID > 8) in 104 ms on 172.17.0.5 (executor 1) (4/5) > 20/09/17 11:07:35 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID > 9) in 82 ms on 172.17.0.6 (executor 2) (5/5) > 20/09/17 11:07:35 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks > have all completed, from pool > 20/09/17 11:07:35 INFO DAGScheduler: ResultStage 1 (collect at > /opt/spark/tests/decommissioning.py:44) finished in 0.943 s > 20/09/17 11:07:35 INFO DAGScheduler: Job 0 is finished. Cancelling > potential speculative or zombie tasks for this job > 20/09/17 11:07:35 INFO TaskSchedulerImpl: Killing all running tasks in > stage 1: Stage finished > 20/09/17 11:07:35 INFO DAGScheduler: Job 0 finished: collect at > /opt/spark/tests/decommissioning.py:44, took 3.420388 s > 1st accumulator value is: 100 > Waiting to give nodes time to finish migration, decom exec 1. > ... > 20/09/17 11:07:36 WARN > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Received executor > 1 decommissioned message > 20/09/17 11:07:36 INFO ShuffleStatus: Updating map output for 4 to > BlockManagerId(2, 172.17.0.6, 33547, None) > 20/09/17 11:07:36 INFO ShuffleStatus: Updating map output for 1 to > BlockManagerId(2, 172.17.0.6, 33547, None) > 20/09/17 11:07:36 INFO ShuffleStatus: Updating map output for 2 to > BlockManagerId(2, 172.17.0.6, 33547, None) > 20/09/17 11:07:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > in memory (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:07:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on > 172.17.0.5:46327 in memory (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:07:36 INFO BlockManagerInfo: Removed broadcast_1_piece0 on > 172.17.0.6:33547 in memory (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:07:37 ERROR TaskSchedulerImpl: Lost executor 1 on 172.17.0.5: > Executor decommission. > 20/09/17 11:07:37 INFO DAGScheduler: Executor lost: 1 (epoch 1) > 20/09/17 11:07:37 INFO BlockManagerMasterEndpoint: Trying to remove > executor 1 from BlockManagerMaster. > 20/09/17 11:07:37 INFO BlockManagerMasterEndpoint: Removing block manager > BlockManagerId(1, 172.17.0.5, 46327, None) > 20/09/17 11:07:37 INFO BlockManagerMaster: Removed 1 successfully in > removeExecutor > 20/09/17 11:07:37 INFO DAGScheduler: Shuffle files lost for executor: 1 > (epoch 1) > 20/09/17 11:07:41 INFO ExecutorPodsAllocator: Going to request 1 executors > from Kubernetes. > 20/09/17 11:07:41 INFO BasicExecutorFeatureStep: Adding decommission script > to lifecycle > 20/09/17 11:07:41 INFO KubernetesClientUtils: Spark configuration files > loaded from Some(/opt/spark/conf) : > 20/09/17 11:07:43 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Registered > executor NettyRpcEndpointRef(spark-client://Executor) (172.17.0.5:35848) with > ID 3, ResourceProfileId 0 > 20/09/17 11:07:43 INFO BlockManagerMasterEndpoint: Registering block > manager 172.17.0.5:34299 with 593.9 MiB RAM, BlockManagerId(3, 172.17.0.5, > 34299, None) > 20/09/17 11:08:05 INFO SparkContext: Starting job: count at > /opt/spark/tests/decommissioning.py:49 > 20/09/17 11:08:05 INFO DAGScheduler: Got job 1 (count at > /opt/spark/tests/decommissioning.py:49) with 5 output partitions > 20/09/17 11:08:05 INFO DAGScheduler: Final stage: ResultStage 3 (count at > /opt/spark/tests/decommissioning.py:49) > 20/09/17 11:08:05 INFO DAGScheduler: Parents of final stage: > List(ShuffleMapStage 2) > 20/09/17 11:08:05 INFO DAGScheduler: Missing parents: List() > 20/09/17 11:08:05 INFO DAGScheduler: Submitting ResultStage 3 (PythonRDD[6] > at count at /opt/spark/tests/decommissioning.py:49), which has no missing > parents > 20/09/17 11:08:05 INFO MemoryStore: Block broadcast_2 stored as values in > memory (estimated size 10.6 KiB, free 593.9 MiB) > 20/09/17 11:08:05 INFO MemoryStore: Block broadcast_2_piece0 stored as > bytes in memory (estimated size 5.9 KiB, free 593.9 MiB) > 20/09/17 11:08:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory > on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > (size: 5.9 KiB, free: 593.9 MiB) > 20/09/17 11:08:05 INFO SparkContext: Created broadcast 2 from broadcast at > DAGScheduler.scala:1348 > 20/09/17 11:08:05 INFO DAGScheduler: Submitting 5 missing tasks from > ResultStage 3 (PythonRDD[6] at count at > /opt/spark/tests/decommissioning.py:49) (first 15 tasks are for partitions > Vector(0, 1, 2, 3, 4)) > 20/09/17 11:08:05 INFO TaskSchedulerImpl: Adding task set 3.0 with 5 tasks > resource profile 0 > 20/09/17 11:08:05 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID > 10) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:05 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory > on 172.17.0.6:33547 (size: 5.9 KiB, free: 593.9 MiB) > 20/09/17 11:08:05 INFO MapOutputTrackerMasterEndpoint: Asked to send map > output locations for shuffle 0 to 172.17.0.6:50176 > 20/09/17 11:08:05 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID > 10) in 133 ms on 172.17.0.6 (executor 2) (1/5) > 20/09/17 11:08:05 INFO TaskSetManager: Starting task 1.0 in stage 3.0 (TID > 11) (172.17.0.6, executor 2, partition 1, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:05 INFO TaskSetManager: Starting task 2.0 in stage 3.0 (TID > 12) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:05 INFO TaskSetManager: Finished task 1.0 in stage 3.0 (TID > 11) in 81 ms on 172.17.0.6 (executor 2) (2/5) > 20/09/17 11:08:05 INFO TaskSetManager: Starting task 3.0 in stage 3.0 (TID > 13) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:05 INFO TaskSetManager: Finished task 2.0 in stage 3.0 (TID > 12) in 85 ms on 172.17.0.6 (executor 2) (3/5) > 20/09/17 11:08:05 INFO TaskSetManager: Starting task 4.0 in stage 3.0 (TID > 14) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:05 INFO TaskSetManager: Finished task 3.0 in stage 3.0 (TID > 13) in 73 ms on 172.17.0.6 (executor 2) (4/5) > 20/09/17 11:08:06 INFO TaskSetManager: Finished task 4.0 in stage 3.0 (TID > 14) in 91 ms on 172.17.0.6 (executor 2) (5/5) > 20/09/17 11:08:06 INFO TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks > have all completed, from pool > 20/09/17 11:08:06 INFO DAGScheduler: ResultStage 3 (count at > /opt/spark/tests/decommissioning.py:49) finished in 0.478 s > 20/09/17 11:08:06 INFO DAGScheduler: Job 1 is finished. Cancelling > potential speculative or zombie tasks for this job > 20/09/17 11:08:06 INFO TaskSchedulerImpl: Killing all running tasks in > stage 3: Stage finished > 20/09/17 11:08:06 INFO DAGScheduler: Job 1 finished: count at > /opt/spark/tests/decommissioning.py:49, took 0.489355 s > 20/09/17 11:08:06 INFO SparkContext: Starting job: collect at > /opt/spark/tests/decommissioning.py:50 > 20/09/17 11:08:06 INFO DAGScheduler: Got job 2 (collect at > /opt/spark/tests/decommissioning.py:50) with 5 output partitions > 20/09/17 11:08:06 INFO DAGScheduler: Final stage: ResultStage 5 (collect at > /opt/spark/tests/decommissioning.py:50) > 20/09/17 11:08:06 INFO DAGScheduler: Parents of final stage: > List(ShuffleMapStage 4) > 20/09/17 11:08:06 INFO DAGScheduler: Missing parents: List() > 20/09/17 11:08:06 INFO BlockManagerInfo: Removed broadcast_2_piece0 on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > in memory (size: 5.9 KiB, free: 593.9 MiB) > 20/09/17 11:08:06 INFO DAGScheduler: Submitting ResultStage 5 (PythonRDD[5] > at collect at /opt/spark/tests/decommissioning.py:44), which has no missing > parents > 20/09/17 11:08:06 INFO BlockManagerInfo: Removed broadcast_2_piece0 on > 172.17.0.6:33547 in memory (size: 5.9 KiB, free: 593.9 MiB) > 20/09/17 11:08:06 INFO MemoryStore: Block broadcast_3 stored as values in > memory (estimated size 9.3 KiB, free 593.9 MiB) > 20/09/17 11:08:06 INFO MemoryStore: Block broadcast_3_piece0 stored as > bytes in memory (estimated size 5.4 KiB, free 593.9 MiB) > 20/09/17 11:08:06 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory > on > spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:7079 > (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:08:06 INFO SparkContext: Created broadcast 3 from broadcast at > DAGScheduler.scala:1348 > 20/09/17 11:08:06 INFO DAGScheduler: Submitting 5 missing tasks from > ResultStage 5 (PythonRDD[5] at collect at > /opt/spark/tests/decommissioning.py:44) (first 15 tasks are for partitions > Vector(0, 1, 2, 3, 4)) > 20/09/17 11:08:06 INFO TaskSchedulerImpl: Adding task set 5.0 with 5 tasks > resource profile 0 > 20/09/17 11:08:06 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID > 15) (172.17.0.6, executor 2, partition 0, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:06 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory > on 172.17.0.6:33547 (size: 5.4 KiB, free: 593.9 MiB) > 20/09/17 11:08:06 INFO TaskSetManager: Starting task 1.0 in stage 5.0 (TID > 16) (172.17.0.6, executor 2, partition 1, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:06 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID > 15) in 105 ms on 172.17.0.6 (executor 2) (1/5) > 20/09/17 11:08:06 INFO TaskSetManager: Finished task 1.0 in stage 5.0 (TID > 16) in 84 ms on 172.17.0.6 (executor 2) (2/5) > 20/09/17 11:08:06 INFO TaskSetManager: Starting task 2.0 in stage 5.0 (TID > 17) (172.17.0.6, executor 2, partition 2, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:06 INFO TaskSetManager: Starting task 3.0 in stage 5.0 (TID > 18) (172.17.0.6, executor 2, partition 3, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:06 INFO TaskSetManager: Finished task 2.0 in stage 5.0 (TID > 17) in 76 ms on 172.17.0.6 (executor 2) (3/5) > 20/09/17 11:08:06 INFO TaskSetManager: Starting task 4.0 in stage 5.0 (TID > 19) (172.17.0.6, executor 2, partition 4, NODE_LOCAL, 7162 bytes) > taskResourceAssignments Map() > 20/09/17 11:08:06 INFO TaskSetManager: Finished task 3.0 in stage 5.0 (TID > 18) in 72 ms on 172.17.0.6 (executor 2) (4/5) > 20/09/17 11:08:06 INFO TaskSetManager: Finished task 4.0 in stage 5.0 (TID > 19) in 90 ms on 172.17.0.6 (executor 2) (5/5) > 20/09/17 11:08:06 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks > have all completed, from pool > 20/09/17 11:08:06 INFO DAGScheduler: ResultStage 5 (collect at > /opt/spark/tests/decommissioning.py:50) finished in 0.448 s > 20/09/17 11:08:06 INFO DAGScheduler: Job 2 is finished. Cancelling > potential speculative or zombie tasks for this job > 20/09/17 11:08:06 INFO TaskSchedulerImpl: Killing all running tasks in > stage 5: Stage finished > 20/09/17 11:08:06 INFO DAGScheduler: Job 2 finished: collect at > /opt/spark/tests/decommissioning.py:50, took 0.460430 s > Final accumulator value is: 100 > Finished waiting, stopping Spark. > 20/09/17 11:08:06 INFO SparkUI: Stopped Spark web UI at > http://spark-test-app-08853d749bbee080-driver-svc.a0af92633bef4a91b5f7e262e919afd9.svc:4040 > 20/09/17 11:08:06 INFO KubernetesClusterSchedulerBackend: Shutting down all > executors > 20/09/17 11:08:06 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 20/09/17 11:08:06 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client > has been closed (this is expected if the application is shutting down.) > 20/09/17 11:08:06 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 20/09/17 11:08:06 INFO MemoryStore: MemoryStore cleared > 20/09/17 11:08:06 INFO BlockManager: BlockManager stopped > 20/09/17 11:08:06 INFO BlockManagerMaster: BlockManagerMaster stopped > 20/09/17 11:08:06 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 20/09/17 11:08:06 INFO SparkContext: Successfully stopped SparkContext > Done, exiting Python > 20/09/17 11:08:07 INFO ShutdownHookManager: Shutdown hook called > 20/09/17 11:08:07 INFO ShutdownHookManager: Deleting directory > /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/spark-d5ac2f3e-fe8b-4122-8026-807d265f3a69/pyspark-62a6caeb-b2e5-4b8f-8eb3-e7b2c5fb155c > 20/09/17 11:08:07 INFO ShutdownHookManager: Deleting directory > /var/data/spark-7985c075-3b02-42ec-9111-cefba535adf0/spark-d5ac2f3e-fe8b-4122-8026-807d265f3a69 > 20/09/17 11:08:07 INFO ShutdownHookManager: Deleting directory > /tmp/spark-b74e6224-3fa7-40d2-abc4-6622bd524e65 > " did not contain "Received decommission executor message" The application > did not complete, did not find str Received decommission executor message. > (KubernetesSuite.scala:387) > Run completed in 12 minutes, 29 seconds. > Total number of tests run: 18 > Suites: completed 2, aborted 0 > Tests: succeeded 17, failed 1, canceled 0, ignored 0, pending 0 > *** 1 TEST FAILED *** > [INFO] > ------------------------------------------------------------------------ > [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ........................... SUCCESS [ 4.094 > s] > [INFO] Spark Project Tags ................................. SUCCESS [ 8.630 > s] > [INFO] Spark Project Local DB ............................. SUCCESS [ 4.062 > s] > [INFO] Spark Project Networking ........................... SUCCESS [ 5.891 > s] > [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.059 > s] > [INFO] Spark Project Unsafe ............................... SUCCESS [ 10.869 > s] > [INFO] Spark Project Launcher ............................. SUCCESS [ 3.432 > s] > [INFO] Spark Project Core ................................. SUCCESS [02:26 > min] > [INFO] Spark Project Kubernetes Integration Tests ......... FAILURE [15:13 > min] > [INFO] > ------------------------------------------------------------------------ > [INFO] BUILD FAILURE > [INFO] > ------------------------------------------------------------------------ > [INFO] Total time: 18:21 min > [INFO] Finished at: 2020-09-17T04:10:08-07:00 > [INFO] > ------------------------------------------------------------------------ > [ERROR] Failed to execute goal > org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project > spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the errors and possible solutions, please > read the following articles: > [ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException > [ERROR] > [ERROR] After correcting the problems, you can resume the build with the > command > [ERROR] mvn <args> -rf :spark-kubernetes-integration-tests_2.12 > + retcode3=1 > + kill -9 82255 > + minikube stop > : Stopping "minikube" in kvm2 ... > - "minikube" stopped. > /tmp/hudson6767824981271828433.sh: line 66: 82255 Killed > minikube mount ${PVC_TESTS_HOST_PATH}:${PVC_TESTS_VM_PATH} > --9p-version=9p2000.L --gid=0 --uid=185 > + [[ 1 = 0 ]] > + test_status=failure > + /home/jenkins/bin/post_github_pr_comment.py > Attempting to post to Github... > > Post successful. > + rm -rf /tmp/tmp.epTpFHp0Dl > + exit 1 > Build step 'Execute shell' marked build as failure > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org