Lin Yiqun created HDFS-9863: ------------------------------- Summary: DataNode doesn't log any shutdown info when the process of DataNode exiting Key: HDFS-9863 URL: https://issues.apache.org/jira/browse/HDFS-9863 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.7.1 Reporter: Lin Yiqun
One of my datanodes exited without any shutdown info. {code} 2016-02-25 14:46:00,183 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /XX.XX.2.82:10491, dest: /XX.XX.6.32:50010, bytes: 166, op: HDFS_WRITE, cliID: DFSClient_attempt_1451454983710_2070249_m_000004_0_-382493720_1, offset: 0, srvID: 57893f91-9a19-46f2-b68f-9dc599cd3988, blockid: BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224522_658031116, duration: 1216857 2016-02-25 14:46:00,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224522_658031116, type=LAST_IN_PIPELINE, downstreams=0:[] terminating 2016-02-25 14:46:00,279 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224536_658031130 src: /XX.XX.6.32:57648 dest: /XX.XX.6.32:50010 2016-02-25 14:46:00,283 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /XX.XX.6.32:57648, dest: /XX.XX.6.32:50010, bytes: 6584, op: HDFS_WRITE, cliID: DFSClient_NONMAPREDUCE_1835325650_67282592, offset: 0, srvID: 57893f91-9a19-46f2-b68f-9dc599cd3988, blockid: BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224536_658031130, duration: 1771477 2016-02-25 14:46:00,283 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder: BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224536_658031130, type=HAS_DOWNSTREAM_IN_PIPELINE terminating 2016-02-25 15:03:55,639 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = XX.XX6032/XX.XX.6.32 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.7.1 {code} I think maybe full gc causes this problem, so I looked the datanode gc log. There is a cms gc but the time of this gc is after than restart datanode time. {code} 2016-02-25T15:03:57.930+0800: 2.756: [GC2016-02-25T15:03:57.930+0800: 2.756: [ParNew: 1677824K->24417K(1887488K), 0.0249280 secs] 1677824K->24417K(8178944K), 0.0251010 secs] [Times: user=0.24 sys=0.07, real=0.02 secs] 2016-02-25T15:12:46.498+0800: 531.324: [GC [1 CMS-initial-mark: 0K(6291456K)] 780481K(8178944K), 0.0554170 secs] [Times: user=0.06 sys=0.00, real=0.07 secs] 2016-02-25T15:12:46.567+0800: 531.393: [CMS-concurrent-mark-start] 2016-02-25T15:12:46.574+0800: 531.400: [CMS-concurrent-mark: 0.006/0.007 secs] [Times: user=0.07 sys=0.02, real=0.01 secs] 2016-02-25T15:12:46.574+0800: 531.400: [CMS-concurrent-preclean-start] 2016-02-25T15:12:46.589+0800: 531.415: [CMS-concurrent-preclean: 0.015/0.015 secs] [Times: user=0.16 sys=0.06, real=0.01 secs] 2016-02-25T15:12:46.589+0800: 531.415: [CMS-concurrent-abortable-preclean-start] 2016-02-25T15:12:47.364+0800: 532.190: [GC2016-02-25T15:12:47.364+0800: 532.190: [ParNew: 1702241K->40866K(1887488K), 0.0254400 secs] 1702241K->40866K(8178944K), 0.0256130 secs] [Times: user=0.34 sys=0.04, real=0.02 secs] 2016-02-25T15:12:48.546+0800: 533.372: [GC2016-02-25T15:12:48.546+0800: 533.372: [ParNew: 1718690K->72736K(1887488K), 0.0326820 secs] 1718690K->72736K(8178944K), 0.0328620 secs] [Times: user=0.62 sys=0.01, real=0.03 secs] 2016-02-25T15:12:49.084+0800: 533.910: [CMS-concurrent-abortable-preclean: 0.128/2.495 secs] [Times: user=9.53 sys=17.87, real=2.50 secs] 2016-02-25T15:12:49.084+0800: 533.910: [GC[YG occupancy: 856902 K (1887488 K)]2016-02-25T15:12:49.085+0800: 533.910: [Rescan (parallel) , 0.3019260 secs]2016-02-25T15:12:49.387+0800: 534.212: [weak refs processing, 0.0000420 secs]2016-02-25T15:12:49.387+0800: 534.212: [class unloading, 0.0015120 secs]2016-02-25T15:12:49.388+0800: 534.214: [scrub symbol table, 0.0021410 secs]2016-02-25T15:12:49.390+0800: 534.216: [scrub string table, 0.0003290 secs] [1 CMS-remark: 0K(6291456K)] 856902K(8178944K), 0.3064870 secs] [Times: user=6.49 sys=0.00, real=0.31 secs] 2016-02-25T15:12:49.391+0800: 534.217: [CMS-concurrent-sweep-start] 2016-02-25T15:12:49.396+0800: 534.222: [CMS-concurrent-sweep: 0.005/0.005 secs] [Times: user=0.03 sys=0.02, real=0.00 secs] 2016-02-25T15:12:49.396+0800: 534.222: [CMS-concurrent-reset-start] 2016-02-25T15:12:49.420+0800: 534.246: [CMS-concurrent-reset: 0.024/0.024 secs] [Times: user=0.10 sys=0.18, real=0.02 secs] 2016-02-25T15:12:50.013+0800: 534.839: [GC2016-02-25T15:12:50.013+0800: 534.839: [ParNew: 1750560K->98424K(1887488K), 0.0523470 secs] 1750560K->98424K(8178944K) icms_dc=5 , 0.0525090 secs] [Times: user=1.12 sys=0.02, real=0.05 secs] 2016-02-25T15:12:50.609+0800: 535.435: [GC [1 CMS-initial-mark: 0K(6291456K)] 897492K(8178944K), 0.3025000 secs] [Times: user=0.30 sys=0.00, real=0.31 secs] 2016-02-25T15:12:50.912+0800: 535.738: [CMS-concurrent-mark-start] 2016-02-25T15:12:50.916+0800: 535.742: [CMS-concurrent-mark: 0.004/0.004 secs] [Times: user=0.03 sys=0.03, real=0.00 secs] 2016-02-25T15:12:50.916+0800: 535.742: [CMS-concurrent-preclean-start] 2016-02-25T15:12:50.926+0800: 535.752: [CMS-concurrent-preclean: 0.011/0.011 secs] [Times: user=0.05 sys=0.08, real=0.01 secs] 2016-02-25T15:12:50.926+0800: 535.752: [CMS-concurrent-abortable-preclean-start] 2016-02-25T15:12:51.541+0800: 536.367: [GC2016-02-25T15:12:51.541+0800: 536.367: [ParNew: 1776248K->153122K(1887488K), 0.0840210 secs] 1776248K->153122K(8178944K) icms_dc=5 , 0.0841990 secs] [Times: user=1.82 sys=0.01, real=0.08 secs] {code} It seems this is not the main reason. Gc of time before datanode exiting seems normal. {code} 2016-02-25T14:38:10.101+0800: 5430962.154: [GC2016-02-25T14:38:10.101+0800: 5430962.154: [ParNew: 1680842K->2532K(1887488K), 0.0047330 secs] 2902596K->1224299K(8178944K) icms_dc=0 , 0.0048760 secs] [Times: user=0.09 sys=0.00, real=0.00 secs] 2016-02-25T14:39:43.301+0800: 5431055.354: [GC2016-02-25T14:39:43.301+0800: 5431055.354: [ParNew: 1680356K->9382K(1887488K), 0.0196580 secs] 2902123K->1231150K(8178944K) icms_dc=0 , 0.0198120 secs] [Times: user=0.19 sys=0.01, real=0.02 secs] 2016-02-25T14:39:54.301+0800: 5431066.353: [GC2016-02-25T14:39:54.301+0800: 5431066.353: [ParNew: 1687206K->8257K(1887488K), 0.0174490 secs] 2908974K->1230036K(8178944K) icms_dc=0 , 0.0176300 secs] [Times: user=0.18 sys=0.00, real=0.02 secs] 2016-02-25T14:40:43.231+0800: 5431115.284: [GC2016-02-25T14:40:43.231+0800: 5431115.284: [ParNew: 1686081K->8975K(1887488K), 0.0288890 secs] 2907860K->1230755K(8178944K) icms_dc=0 , 0.0291150 secs] [Times: user=0.29 sys=0.01, real=0.03 secs] 2016-02-25T14:45:39.743+0800: 5431411.796: [GC2016-02-25T14:45:39.743+0800: 5431411.796: [ParNew: 1686799K->22696K(1887488K), 0.0385700 secs] 2908579K->1244476K(8178944K) icms_dc=0 , 0.0388280 secs] [Times: user=0.23 sys=0.01, real=0.04 secs] {code} So it looks confusion. Attach the gc logs and datanode log. -- This message was sent by Atlassian JIRA (v6.3.4#6332)