please provide your jstack info.



------------------ ???????? ------------------
??????: "dhruve ashar";<dhruveas...@gmail.com>;
????????: 2016??7??13??(??????) ????3:53
??????: "Anton Sviridov"<keyn...@gmail.com>; 
????: "user"<user@spark.apache.org>; 
????: Re: Spark hangs at "Removed broadcast_*"



Looking at the jstack, it seems that it doesn't contain all the threads. Cannot 
find the main thread in the jstack.

I am not an expert on analyzing jstacks, but are you creating any threads in 
your code? Shutting them down correctly?


This one is a non-daemon and doesn't seem to be coming from Spark. 
"Scheduler-2144644334" #110 prio=5 os_prio=0 tid=0x00007f8104001800 nid=0x715 
waiting on condition [0x00007f812cf95000]



Also, does the shutdown hook get called? 




On Tue, Jul 12, 2016 at 2:35 AM, Anton Sviridov <keyn...@gmail.com> wrote:
Hi.

Here's the last few lines before it starts removing broadcasts:


16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_201607111123_0009_m_003209_20886' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003209
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_201607111123_0009_m_003209_20886: Committed
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3211.0 in stage 9.0 (TID 
20888) in 95 ms on localhost (3209/3214)
16/07/11 14:02:11 INFO Executor: Finished task 3209.0 in stage 9.0 (TID 20886). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3209.0 in stage 9.0 (TID 
20886) in 103 ms on localhost (3210/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_201607111123_0009_m_003208_20885' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003208
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_201607111123_0009_m_003208_20885: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3208.0 in stage 9.0 (TID 20885). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3208.0 in stage 9.0 (TID 
20885) in 109 ms on localhost (3211/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_201607111123_0009_m_003212_20889' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003212
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_201607111123_0009_m_003212_20889: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3212.0 in stage 9.0 (TID 20889). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3212.0 in stage 9.0 (TID 
20889) in 84 ms on localhost (3212/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_201607111123_0009_m_003210_20887' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003210
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_201607111123_0009_m_003210_20887: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3210.0 in stage 9.0 (TID 20887). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3210.0 in stage 9.0 (TID 
20887) in 100 ms on localhost (3213/3214)
16/07/11 14:02:11 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 1
16/07/11 14:02:11 INFO FileOutputCommitter: Saved output of task 
'attempt_201607111123_0009_m_003213_20890' to 
file:/mnt/rendang/cache-main/RunWikistatsSFCounts727fc9d635f25d0922984e59a0d18fdd/stats/sf_counts/_temporary/0/task_201607111123_0009_m_003213
16/07/11 14:02:11 INFO SparkHadoopMapRedUtil: 
attempt_201607111123_0009_m_003213_20890: Committed
16/07/11 14:02:11 INFO Executor: Finished task 3213.0 in stage 9.0 (TID 20890). 
1721 bytes result sent to driver
16/07/11 14:02:11 INFO TaskSetManager: Finished task 3213.0 in stage 9.0 (TID 
20890) in 82 ms on localhost (3214/3214)
16/07/11 14:02:11 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have 
all completed, from pool
16/07/11 14:02:11 INFO DAGScheduler: ResultStage 9 (saveAsTextFile at 
SfCountsDumper.scala:13) finished in 42.294 s
16/07/11 14:02:11 INFO DAGScheduler: Job 1 finished: saveAsTextFile at 
SfCountsDumper.scala:13, took 9517.124624 s
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_0_piece0 on 
10.101.230.154:35192 in memory (size: 15.8 KB, free: 37.1 GB)
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 7
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 6
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 5
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 4
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 3
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 2
16/07/11 14:28:46 INFO ContextCleaner: Cleaned shuffle 1
16/07/11 14:28:46 INFO BlockManager: Removing RDD 14
16/07/11 14:28:46 INFO ContextCleaner: Cleaned RDD 14
16/07/11 14:28:46 INFO BlockManagerInfo: Removed broadcast_11_piece0 on 
10.101.230.154:35192 in memory (size: 25.5 KB, free: 37.1 GB)

...


In fact, the job is still running, Spark's UI shows uptime of 20.6 hours with 
last job finishing 18 hours ago at least.


On Mon, 11 Jul 2016 at 23:23 dhruve ashar <dhruveas...@gmail.com> wrote:

Hi, 

Can you check the time when the job actually finished from the logs. The logs 
provided are too short and do not reveal meaningful information. 


  


On Mon, Jul 11, 2016 at 9:50 AM, velvetbaldmime <keyn...@gmail.com> wrote:
Spark 2.0.0-preview
 
 We've got an app that uses a fairly big broadcast variable. We run this on a
 big EC2 instance, so deployment is in client-mode. Broadcasted variable is a
 massive Map[String, Array[String]].
 
 At the end of saveAsTextFile, the output in the folder seems to be complete
 and correct (apart from .crc files still being there) BUT the spark-submit
 process is stuck on, seemingly, removing the broadcast variable. The stuck
 logs look like this: http://pastebin.com/wpTqvArY
 
 My last run lasted for 12 hours after after doing saveAsTextFile - just
 sitting there. I did a jstack on driver process, most threads are parked:
 http://pastebin.com/E29JKVT7
 
 Full store: We used this code with Spark 1.5.0 and it worked, but then the
 data changed and something stopped fitting into Kryo's serialisation buffer.
 Increasing it didn't help, so I had to disable the KryoSerialiser. Tested it
 again - it hanged. Switched to 2.0.0-preview - seems like the same issue.
 
 I'm not quite sure what's even going on given that there's almost no CPU
 activity and no output in the logs, yet the output is not finalised like it
 used to before.
 
 Would appreciate any help, thanks
 
 
 
 --
 View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-hangs-at-Removed-broadcast-tp27320.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
 ---------------------------------------------------------------------
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org
 
 





-- 
-Dhruve Ashar



 

 






-- 
-Dhruve Ashar

Reply via email to