[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267257#comment-13267257 ] Roman K commented on GIRAPH-169: I successfully reproduced the problem even on the simpler case with 1 worker only on pseudo distributed environment: hadoop jar giraph-0.2-SNAPSHOT-jar-with-dependencies.jar org.apache.giraph.benchmark.PageRankBenchmark -e 1 -s 10 -v -V 1000 -w 1 I took the full thread dump of the hung child process using jstack (this is the meaningful part without GC threads) but didn't succeed to figure out the problem yet : pool-1-thread-1 prio=10 tid=0x7f0398539000 nid=0x2218 waiting on condition [0x7f0356d87000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0xfe1613a8 (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907) at java.lang.Thread.run(Thread.java:662) pool-2-thread-1 prio=10 tid=0x7f03984ed000 nid=0x2213 runnable [0x7f035728c000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:210) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69) - locked 0xfe1880f0 (a sun.nio.ch.Util$2) - locked 0xfe188100 (a java.util.Collections$UnmodifiableSet) - locked 0xfe1880a8 (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:84) at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:333) - locked 0xfe188110 (a org.apache.hadoop.ipc.Server$Listener$Reader) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) LeaseChecker daemon prio=10 tid=0x7f039847a800 nid=0x21fa waiting on condition [0x7f035758f000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hdfs.DFSClient$LeaseChecker.run(DFSClient.java:1376) at java.lang.Thread.run(Thread.java:662) Thread for syncLogs daemon prio=10 tid=0x7f0398479000 nid=0x21eb waiting on condition [0x7f0357b9a000] java.lang.Thread.State: TIMED_WAITING (sleeping) at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.mapred.Child$3.run(Child.java:139) Low Memory Detector daemon prio=10 tid=0x7f039809c000 nid=0x21e2 runnable [0x] java.lang.Thread.State: RUNNABLE C2 CompilerThread1 daemon prio=10 tid=0x7f0398099800 nid=0x21e1 waiting on condition [0x] java.lang.Thread.State: RUNNABLE C2 CompilerThread0 daemon prio=10 tid=0x7f0398096800 nid=0x21e0 waiting on condition [0x] java.lang.Thread.State: RUNNABLE Signal Dispatcher daemon prio=10 tid=0x7f0398094800 nid=0x21df runnable [0x] java.lang.Thread.State: RUNNABLE Finalizer daemon prio=10 tid=0x7f0398078000 nid=0x21de in Object.wait() [0x7f0394af9000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xfe158540 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118) - locked 0xfe158540 (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Reference Handler daemon prio=10 tid=0x7f0398076000 nid=0x21dd in Object.wait() [0x7f0394bfa000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on 0xfe160070 (a java.lang.ref.Reference$Lock) at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) - locked 0xfe160070 (a java.lang.ref.Reference$Lock)
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13267419#comment-13267419 ] Roman K commented on GIRAPH-169: I have recompiled it with mvn -Phadoop 1.0 but the issue described above is still here. Additionally, I can see that the hung child vm is re-used on the next evaluation by hadoop. How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement Components: mapreduce Affects Versions: 0.2.0 Environment: sles 11 x64,jdk 1.6,hadoop 0.20.205.0,1 Master and 8 slaves, Reporter: Jianfeng Qian Priority: Minor I ran pagerank at hadoop 0.20.205.0. When the job finished,the child in slaves didn't quit immediately and sometimes they never quit and I have to kill them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13245488#comment-13245488 ] Avery Ching commented on GIRAPH-169: This is a simple case. I'll try and see if I can replicate it sometime this week. Feel free to bug me if I forget. =) How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement Components: mapreduce Affects Versions: 0.2.0 Environment: sles 11 x64,jdk 1.6,hadoop 0.20.205.0,1 Master and 8 slaves, Reporter: Jianfeng Qian Priority: Minor I ran pagerank at hadoop 0.20.205.0. When the job finished,the child in slaves didn't quit immediately and sometimes they never quit and I have to kill them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240234#comment-13240234 ] Avery Ching commented on GIRAPH-169: Looks like the worker log got cut off? Also, what version of Hadoop is this? Does it work with different numbers of workers? How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement Components: mapreduce Affects Versions: 0.2.0 Environment: sles 11 x64,jdk 1.6,hadoop 0.20.205.0,1 Master and 8 slaves, Reporter: Jianfeng Qian Priority: Minor I ran pagerank at hadoop 0.20.205.0. When the job finished,the child in slaves didn't quit immediately and sometimes they never quit and I have to kill them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240285#comment-13240285 ] Jianfeng Qian commented on GIRAPH-169: -- hadoop 0.20.205.0. most of the time, the worker can't quit. sorry, there is the full worker log 2012-03-28 10:18:00,122 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-03-28 10:18:00,387 WARN org.apache.giraph.bsp.BspOutputFormat: getOutputCommitter: Returning ImmutableOutputCommiter (does nothing). 2012-03-28 10:18:00,397 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2012-03-28 10:18:00,405 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@18330bf 2012-03-28 10:18:00,489 INFO org.apache.giraph.graph.GraphMapper: Distributed cache is empty. Assuming fatjar. 2012-03-28 10:18:00,489 INFO org.apache.giraph.graph.GraphMapper: setup: classpath @ /usr/local/test-0302/hadoop-data/h-0.20.205/mapred/local/taskTracker/root/jobcache/job_201203281017_0001/jars/job.jar 2012-03-28 10:18:00,498 INFO org.apache.giraph.zk.ZooKeeperManager: createCandidateStamp: Made the directory _bsp/_defaultZkManagerDir/job_201203281017_0001 2012-03-28 10:18:00,500 INFO org.apache.giraph.zk.ZooKeeperManager: createCandidateStamp: Creating my filestamp _bsp/_defaultZkManagerDir/job_201203281017_0001/_task/tmm-e6 1 2012-03-28 10:18:00,521 INFO org.apache.giraph.zk.ZooKeeperManager: getZooKeeperServerList: For task 1, got file 'zkServerList_tmm-e10 0 ' (polling period is 3000) 2012-03-28 10:18:00,521 INFO org.apache.giraph.zk.ZooKeeperManager: getZooKeeperServerList: Found [tmm-e10, 0] 2 hosts in filename 'zkServerList_tmm-e10 0 ' 2012-03-28 10:18:00,524 INFO org.apache.giraph.zk.ZooKeeperManager: onlineZooKeeperServers: Got [tmm-e10] 1 hosts from 1 ready servers when 1 required (polling period is 3000) on attempt 0 2012-03-28 10:18:00,524 INFO org.apache.giraph.graph.GraphMapper: setup: Starting up BspServiceWorker... 2012-03-28 10:18:00,534 INFO org.apache.giraph.graph.BspService: BspService: Connecting to ZooKeeper with job job_201203281017_0001, 1 on tmm-e10:22181 2012-03-28 10:18:00,540 INFO org.apache.zookeeper.ZooKeeper: Client environment:zookeeper.version=3.3.3-1073969, built on 02/23/2011 22:27 GMT 2012-03-28 10:18:00,540 INFO org.apache.zookeeper.ZooKeeper: Client environment:host.name=tmm-e6 2012-03-28 10:18:00,540 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.version=1.6.0_22 2012-03-28 10:18:00,540 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.vendor=Sun Microsystems Inc. 2012-03-28 10:18:00,540 INFO org.apache.zookeeper.ZooKeeper: Client environment:java.home=/usr/local/java/jdk1.6.0_22/jre 2012-03-28 10:18:00,540 INFO org.apache.zookeeper.ZooKeeper: Client
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240106#comment-13240106 ] Jianfeng Qian commented on GIRAPH-169: -- I ran pagerank for a lots of time to evaluate the performance of giraph and hama. such as /usr/local/test-0302/hadoop-0.20.205.0/bin/hadoop jar /usr/local/test-0302/hadoop-0.20.205.0/lib/giraph-0.2-all.jar org.apache.giraph.benchmark.PageRankBenchmark -e 4 -s 5 -v -V 1024 -w 64 -c 1 When the job finished. I use jps and always find out all child on slaves can't quit . tmm-e1 22212 SecondaryNameNode 22009 NameNode 23336 Jps 22338 JobTracker tmm-e2 5840 Child 4863 DataNode 5724 Child 7029 Jps 5001 TaskTracker 5492 Child 5259 Child 5376 Child 5608 Child 5143 Child tmm-e3 .. How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement Components: mapreduce Affects Versions: 0.2.0 Environment: sles 11 x64,jdk 1.6,hadoop 0.20.205.0,1 Master and 8 slaves, Reporter: Jianfeng Qian Priority: Minor I ran pagerank at hadoop 0.20.205.0. When the job finished,the child in slaves didn't quit immediately and sometimes they never quit and I have to kill them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240112#comment-13240112 ] Avery Ching commented on GIRAPH-169: How many task trackers do you have? Are you seeing any errors? Is the job completing successfully? I'm guessing that the job isn't completing successfully, since everything should be cleaned up. How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement Components: mapreduce Affects Versions: 0.2.0 Environment: sles 11 x64,jdk 1.6,hadoop 0.20.205.0,1 Master and 8 slaves, Reporter: Jianfeng Qian Priority: Minor I ran pagerank at hadoop 0.20.205.0. When the job finished,the child in slaves didn't quit immediately and sometimes they never quit and I have to kill them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240116#comment-13240116 ] Jianfeng Qian commented on GIRAPH-169: -- I have a master and 9 slaves,and each slave has 9 mapper. The console output is as follow: /usr/local/test-0302/hadoop-0.20.205.0/bin/hadoop jar /usr/local/test-0302/hadoop-0.20.205.0/lib/giraph -0.2-all.jar org.apache.giraph.benchmark.PageRankBenchmark -e 16 -s 5 -v -V 100 -w 64 -c 1 12/03/28 10:20:43 INFO benchmark.PageRankBenchmark: Using class org.apache.giraph.benchmark.PageRankBenchmark 12/03/28 10:20:43 WARN bsp.BspOutputFormat: checkOutputSpecs: ImmutableOutputCommiter will not check anything 12/03/28 10:20:44 INFO mapred.JobClient: Running job: job_201203281017_0001 12/03/28 10:20:45 INFO mapred.JobClient: map 0% reduce 0% 12/03/28 10:21:03 INFO mapred.JobClient: map 1% reduce 0% 12/03/28 10:21:36 INFO mapred.JobClient: map 100% reduce 0% 12/03/28 10:22:11 INFO mapred.JobClient: Job complete: job_201203281017_0001 12/03/28 10:22:11 INFO mapred.JobClient: Counters: 37 12/03/28 10:22:11 INFO mapred.JobClient: Job Counters 12/03/28 10:22:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=3600296 12/03/28 10:22:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/28 10:22:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/28 10:22:11 INFO mapred.JobClient: Launched map tasks=65 12/03/28 10:22:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0 12/03/28 10:22:11 INFO mapred.JobClient: Giraph Timers 12/03/28 10:22:11 INFO mapred.JobClient: Total (milliseconds)=64352 12/03/28 10:22:11 INFO mapred.JobClient: Superstep 3 (milliseconds)=5068 12/03/28 10:22:11 INFO mapred.JobClient: Setup (milliseconds)=20768 12/03/28 10:22:11 INFO mapred.JobClient: Shutdown (milliseconds)=4826 12/03/28 10:22:11 INFO mapred.JobClient: Vertex input superstep (milliseconds)=10669 12/03/28 10:22:11 INFO mapred.JobClient: Superstep 0 (milliseconds)=6173 12/03/28 10:22:11 INFO mapred.JobClient: Superstep 4 (milliseconds)=5056 12/03/28 10:22:11 INFO mapred.JobClient: Superstep 5 (milliseconds)=806 12/03/28 10:22:11 INFO mapred.JobClient: Superstep 2 (milliseconds)=5265 12/03/28 10:22:11 INFO mapred.JobClient: Superstep 1 (milliseconds)=5717 12/03/28 10:22:11 INFO mapred.JobClient: Giraph Stats 12/03/28 10:22:11 INFO mapred.JobClient: Aggregate edges=1600 12/03/28 10:22:11 INFO mapred.JobClient: Superstep=6 12/03/28 10:22:11 INFO mapred.JobClient: Last checkpointed superstep=0 12/03/28 10:22:11 INFO mapred.JobClient: Current workers=64 12/03/28 10:22:11 INFO mapred.JobClient: Current master task partition=0 12/03/28 10:22:11 INFO mapred.JobClient: Sent messages=0 12/03/28 10:22:11 INFO mapred.JobClient: Aggregate finished vertices=100 12/03/28 10:22:11 INFO mapred.JobClient: Aggregate vertices=100 12/03/28 10:22:11 INFO mapred.JobClient: File Output Format Counters 12/03/28 10:22:11 INFO mapred.JobClient: Bytes Written=0 12/03/28 10:22:11 INFO mapred.JobClient: FileSystemCounters 12/03/28 10:22:11 INFO mapred.JobClient: FILE_BYTES_READ=7552 12/03/28 10:22:11 INFO mapred.JobClient: HDFS_BYTES_READ=2860 12/03/28 10:22:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1425440 12/03/28 10:22:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=282051638 12/03/28 10:22:11 INFO mapred.JobClient: File Input Format Counters 12/03/28 10:22:11 INFO mapred.JobClient: Bytes Read=0 12/03/28 10:22:11 INFO mapred.JobClient: Map-Reduce Framework 12/03/28 10:22:11 INFO mapred.JobClient: Map input records=65 12/03/28 10:22:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=40195567616 12/03/28 10:22:11 INFO mapred.JobClient: Spilled Records=0 12/03/28 10:22:11 INFO mapred.JobClient: CPU time spent (ms)=1544910 12/03/28 10:22:11 INFO mapred.JobClient: Total committed heap usage (bytes)=43066982400 12/03/28 10:22:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=170265358336 12/03/28 10:22:11 INFO mapred.JobClient: Map output records=0 12/03/28 10:22:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=2860 The job information is as follow: Completed Jobs Jobid Priority User Name Map % Complete Map Total Maps Completed Reduce % Complete Reduce Total Reduces Completed Job Scheduling Information Diagnostic Info job_201203281017_0001 NORMAL root org.apache.giraph.benchmark.PageRankBenchmark 100.00% 65 65 100.00% 0 0 NA NA. but all child can't quit. How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement
[jira] [Commented] (GIRAPH-169) How to close all child when a job finished?
[ https://issues.apache.org/jira/browse/GIRAPH-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13240161#comment-13240161 ] Avery Ching commented on GIRAPH-169: Do you have the logs of the workers? I'd like to see why they can't exit. How to close all child when a job finished? --- Key: GIRAPH-169 URL: https://issues.apache.org/jira/browse/GIRAPH-169 Project: Giraph Issue Type: Improvement Components: mapreduce Affects Versions: 0.2.0 Environment: sles 11 x64,jdk 1.6,hadoop 0.20.205.0,1 Master and 8 slaves, Reporter: Jianfeng Qian Priority: Minor I ran pagerank at hadoop 0.20.205.0. When the job finished,the child in slaves didn't quit immediately and sometimes they never quit and I have to kill them. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira