[ 
https://issues.apache.org/jira/browse/SPARK-37976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37976:
---------------------------------
    Component/s: YARN

> All tasks finish but spark job does not conclude. Forever waits for [DONE]
> --------------------------------------------------------------------------
>
>                 Key: SPARK-37976
>                 URL: https://issues.apache.org/jira/browse/SPARK-37976
>             Project: Spark
>          Issue Type: Question
>          Components: PySpark, YARN
>    Affects Versions: 3.1.0
>            Reporter: Nitin Siwach
>            Priority: Major
>
> I am using the following command to submit a spark job:  ```spark-submit 
> --master yarn --conf 
> "spark.jars.packages=com.microsoft.azure:synapseml_2.12:0.9.4" --conf 
> "spark.jars.repositories=https://mmlspark.azureedge.net/maven"; --conf 
> "spark.hadoop.fs.gs.implicit.dir.rep
> air.enable=false" 
> --py-files=gs://monsoon-credittech.appspot.com/monsoon_spark/src.zip 
> gs://monsoon-credittech.appspot.com/monsoon_spark/custom_estimator.py 
> train_evaluate_cv --data-path 
> gs://monsoon-credittech.appspot.com/mar19/training_data.csv
>  --index hash_CR_ACCOUNT_NBR --label flag__6_months --save-results --nrows 
> 100000 --evaluate --run-id run_ss_02```
> Everything in the code finishes. I have {{print('savedddd'); print(scores)}} 
> as the ultimate last line of my code and it executes as well. All the 
> activity on all nodes goes to 0. Yet the job does not conclude. My shell 
> prints ```{{{}22/01/13 19:29:15 INFO 
> org.sparkproject.jetty.server.AbstractConnector: Stopped 
> Spark@a69cfdd\{HTTP/1.1, (http/1.1)}{0.0.0.0:0}```{}}} and that's it. The job 
> constantly shows as running and I have to manually cancel it.
> Providing the output of ```jstack```. hoping it helps:
>  
>  
> {{Full thread dump OpenJDK 64-Bit Server VM (25.292-b10 mixed 
> mode):"DestroyJavaVM" #657 prio=5 os_prio=0 tid=0x00007f9bd8013800 nid=0x3a83 
> waiting on condition [0x0000000000000000]   java.lang.Thread.State: 
> RUNNABLE"pool-42-thread-1" #360 prio=5 os_prio=0 tid=0x00007f9b90582000 
> nid=0x3eb2 waiting on condition [0x00007f9b6b24b000]   
> java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000094d52ac0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>         at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)"yarn-scheduler-endpoint" 
> #261 daemon prio=5 os_prio=0 tid=0x00007f9bd9d61000 nid=0x3dab waiting on 
> condition [0x00007f9b6f185000]   java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0000000088bf5dd0> (a 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at 
> java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
>         at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)"client DomainSocketWatcher" 
> #54 daemon prio=5 os_prio=0 tid=0x00007f9b919d1800 nid=0x3ae3 runnable 
> [0x00007f9b84ee1000]   java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.net.unix.DomainSocketWatcher.doPoll0(Native 
> Method)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher.access$900(DomainSocketWatcher.java:52)
>         at 
> org.apache.hadoop.net.unix.DomainSocketWatcher$2.run(DomainSocketWatcher.java:503)
>         at java.lang.Thread.run(Thread.java:748)"Service Thread" #7 daemon 
> prio=9 os_prio=0 tid=0x00007f9bd80cb800 nid=0x3a8c runnable 
> [0x0000000000000000]   java.lang.Thread.State: RUNNABLE"C1 CompilerThread1" 
> #6 daemon prio=9 os_prio=0 tid=0x00007f9bd80c7000 nid=0x3a8b waiting on 
> condition [0x0000000000000000]   java.lang.Thread.State: RUNNABLE"C2 
> CompilerThread0" #5 daemon prio=9 os_prio=0 tid=0x00007f9bd80c4000 nid=0x3a8a 
> waiting on condition [0x0000000000000000]   java.lang.Thread.State: 
> RUNNABLE"Signal Dispatcher" #4 daemon prio=9 os_prio=0 tid=0x00007f9bd80c1000 
> nid=0x3a89 waiting on condition [0x0000000000000000]   
> java.lang.Thread.State: RUNNABLE"Finalizer" #3 daemon prio=8 os_prio=0 
> tid=0x00007f9bd808b800 nid=0x3a88 in Object.wait() [0x00007f9bc5e87000]   
> java.lang.Thread.State: WAITING (on object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:144)
>         - locked <0x00000000881de630> (a java.lang.ref.ReferenceQueue$Lock)
>         at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:165)
>         at 
> java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:216)"Reference 
> Handler" #2 daemon prio=10 os_prio=0 tid=0x00007f9bd8087000 nid=0x3a87 in 
> Object.wait() [0x00007f9bc5f88000]   java.lang.Thread.State: WAITING (on 
> object monitor)
>         at java.lang.Object.wait(Native Method)
>         at java.lang.Object.wait(Object.java:502)
>         at java.lang.ref.Reference.tryHandlePending(Reference.java:191)
>         - locked <0x00000000881de800> (a java.lang.ref.Reference$Lock)
>         at 
> java.lang.ref.Reference$ReferenceHandler.run(Reference.java:153)"VM Thread" 
> os_prio=0 tid=0x00007f9bd807d800 nid=0x3a86 runnable "GC task thread#0 
> (ParallelGC)" os_prio=0 tid=0x00007f9bd8029000 nid=0x3a84 runnable "GC task 
> thread#1 (ParallelGC)" os_prio=0 tid=0x00007f9bd802a800 nid=0x3a85 runnable 
> "VM Periodic Task Thread" os_prio=0 tid=0x00007f9bd80ce800 nid=0x3a8d waiting 
> on condition 
> JNI global references: 7697Heap
>  PSYoungGen      total 542720K, used 445168K [0x00000000d8000000, 
> 0x00000000ff300000, 0x0000000100000000)
>   eden space 439296K, 93% used 
> [0x00000000d8000000,0x00000000f0f809d0,0x00000000f2d00000)  from space 
> 103424K, 34% used [0x00000000f8e00000,0x00000000fb13b870,0x00000000ff300000)
>   to   space 99328K, 0% used 
> [0x00000000f2d00000,0x00000000f2d00000,0x00000000f8e00000)
>  ParOldGen       total 542208K, used 248894K [0x0000000088000000, 
> 0x00000000a9180000, 0x00000000d8000000)  object space 542208K, 45% used 
> [0x0000000088000000,0x000000009730f8c8,0x00000000a9180000)
>  Metaspace       used 145492K, capacity 160872K, committed 161152K, reserved 
> 1189888K  class space    used 19087K, capacity 20303K, committed 20352K, 
> reserved 1048576K2022-01-20 19:48:51Full thread dump OpenJDK 64-Bit Server VM 
> (25.292-b10 mixed mode):}}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to