Lin Yiqun created HDFS-9863:
-------------------------------

             Summary: DataNode doesn't log any shutdown info when the process 
of DataNode exiting
                 Key: HDFS-9863
                 URL: https://issues.apache.org/jira/browse/HDFS-9863
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Lin Yiqun


One of my datanodes exited without any shutdown info. 
{code}
2016-02-25 14:46:00,183 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/XX.XX.2.82:10491, dest: /XX.XX.6.32:50010, bytes: 166, op: HDFS_WRITE, cliID: 
DFSClient_attempt_1451454983710_2070249_m_000004_0_-382493720_1, offset: 0, 
srvID: 57893f91-9a19-46f2-b68f-9dc599cd3988, blockid: 
BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224522_658031116, duration: 
1216857
2016-02-25 14:46:00,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: 
BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224522_658031116, 
type=LAST_IN_PIPELINE, downstreams=0:[] terminating
2016-02-25 14:46:00,279 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Receiving BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224536_658031130 src: 
/XX.XX.6.32:57648 dest: /XX.XX.6.32:50010
2016-02-25 14:46:00,283 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/XX.XX.6.32:57648, dest: /XX.XX.6.32:50010, bytes: 6584, op: HDFS_WRITE, cliID: 
DFSClient_NONMAPREDUCE_1835325650_67282592, offset: 0, srvID: 
57893f91-9a19-46f2-b68f-9dc599cd3988, blockid: 
BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224536_658031130, duration: 
1771477
2016-02-25 14:46:00,283 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
PacketResponder: 
BP-1942012336-XX.XX.2.191-1406726500544:blk_1730224536_658031130, 
type=HAS_DOWNSTREAM_IN_PIPELINE terminating
2016-02-25 15:03:55,639 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = XX.XX6032/XX.XX.6.32
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 2.7.1
{code}
I think maybe full gc causes this problem, so I looked the datanode gc log. 
There is a cms gc but the time of this gc is after than restart datanode time. 
{code}
2016-02-25T15:03:57.930+0800: 2.756: [GC2016-02-25T15:03:57.930+0800: 2.756: 
[ParNew: 1677824K->24417K(1887488K), 0.0249280 secs] 
1677824K->24417K(8178944K), 0.0251010 secs] [Times: user=0.24 sys=0.07, 
real=0.02 secs]
2016-02-25T15:12:46.498+0800: 531.324: [GC [1 CMS-initial-mark: 0K(6291456K)] 
780481K(8178944K), 0.0554170 secs] [Times: user=0.06 sys=0.00, real=0.07 secs]
2016-02-25T15:12:46.567+0800: 531.393: [CMS-concurrent-mark-start]
2016-02-25T15:12:46.574+0800: 531.400: [CMS-concurrent-mark: 0.006/0.007 secs] 
[Times: user=0.07 sys=0.02, real=0.01 secs]
2016-02-25T15:12:46.574+0800: 531.400: [CMS-concurrent-preclean-start]
2016-02-25T15:12:46.589+0800: 531.415: [CMS-concurrent-preclean: 0.015/0.015 
secs] [Times: user=0.16 sys=0.06, real=0.01 secs]
2016-02-25T15:12:46.589+0800: 531.415: [CMS-concurrent-abortable-preclean-start]
2016-02-25T15:12:47.364+0800: 532.190: [GC2016-02-25T15:12:47.364+0800: 
532.190: [ParNew: 1702241K->40866K(1887488K), 0.0254400 secs] 
1702241K->40866K(8178944K), 0.0256130 secs] [Times: user=0.34 sys=0.04, 
real=0.02 secs]
2016-02-25T15:12:48.546+0800: 533.372: [GC2016-02-25T15:12:48.546+0800: 
533.372: [ParNew: 1718690K->72736K(1887488K), 0.0326820 secs] 
1718690K->72736K(8178944K), 0.0328620 secs] [Times: user=0.62 sys=0.01, 
real=0.03 secs]
2016-02-25T15:12:49.084+0800: 533.910: [CMS-concurrent-abortable-preclean: 
0.128/2.495 secs] [Times: user=9.53 sys=17.87, real=2.50 secs]
2016-02-25T15:12:49.084+0800: 533.910: [GC[YG occupancy: 856902 K (1887488 
K)]2016-02-25T15:12:49.085+0800: 533.910: [Rescan (parallel) , 0.3019260 
secs]2016-02-25T15:12:49.387+0800: 534.212: [weak refs processing, 0.0000420 
secs]2016-02-25T15:12:49.387+0800: 534.212: [class unloading, 0.0015120 
secs]2016-02-25T15:12:49.388+0800: 534.214: [scrub symbol table, 0.0021410 
secs]2016-02-25T15:12:49.390+0800: 534.216: [scrub string table, 0.0003290 
secs] [1 CMS-remark: 0K(6291456K)] 856902K(8178944K), 0.3064870 secs] [Times: 
user=6.49 sys=0.00, real=0.31 secs] 
2016-02-25T15:12:49.391+0800: 534.217: [CMS-concurrent-sweep-start]
2016-02-25T15:12:49.396+0800: 534.222: [CMS-concurrent-sweep: 0.005/0.005 secs] 
[Times: user=0.03 sys=0.02, real=0.00 secs]
2016-02-25T15:12:49.396+0800: 534.222: [CMS-concurrent-reset-start]
2016-02-25T15:12:49.420+0800: 534.246: [CMS-concurrent-reset: 0.024/0.024 secs] 
[Times: user=0.10 sys=0.18, real=0.02 secs]
2016-02-25T15:12:50.013+0800: 534.839: [GC2016-02-25T15:12:50.013+0800: 
534.839: [ParNew: 1750560K->98424K(1887488K), 0.0523470 secs] 
1750560K->98424K(8178944K) icms_dc=5 , 0.0525090 secs] [Times: user=1.12 
sys=0.02, real=0.05 secs]
2016-02-25T15:12:50.609+0800: 535.435: [GC [1 CMS-initial-mark: 0K(6291456K)] 
897492K(8178944K), 0.3025000 secs] [Times: user=0.30 sys=0.00, real=0.31 secs]
2016-02-25T15:12:50.912+0800: 535.738: [CMS-concurrent-mark-start]
2016-02-25T15:12:50.916+0800: 535.742: [CMS-concurrent-mark: 0.004/0.004 secs] 
[Times: user=0.03 sys=0.03, real=0.00 secs]
2016-02-25T15:12:50.916+0800: 535.742: [CMS-concurrent-preclean-start]
2016-02-25T15:12:50.926+0800: 535.752: [CMS-concurrent-preclean: 0.011/0.011 
secs] [Times: user=0.05 sys=0.08, real=0.01 secs]
2016-02-25T15:12:50.926+0800: 535.752: [CMS-concurrent-abortable-preclean-start]
2016-02-25T15:12:51.541+0800: 536.367: [GC2016-02-25T15:12:51.541+0800: 
536.367: [ParNew: 1776248K->153122K(1887488K), 0.0840210 secs] 
1776248K->153122K(8178944K) icms_dc=5 , 0.0841990 secs] [Times: user=1.82 
sys=0.01, real=0.08 secs]
{code}
It seems this is not the main reason. Gc of time before datanode exiting seems 
normal.
{code}
2016-02-25T14:38:10.101+0800: 5430962.154: [GC2016-02-25T14:38:10.101+0800: 
5430962.154: [ParNew: 1680842K->2532K(1887488K), 0.0047330 secs] 
2902596K->1224299K(8178944K) icms_dc=0 , 0.0048760 secs] [Times: user=0.09 
sys=0.00, real=0.00 secs]
2016-02-25T14:39:43.301+0800: 5431055.354: [GC2016-02-25T14:39:43.301+0800: 
5431055.354: [ParNew: 1680356K->9382K(1887488K), 0.0196580 secs] 
2902123K->1231150K(8178944K) icms_dc=0 , 0.0198120 secs] [Times: user=0.19 
sys=0.01, real=0.02 secs]
2016-02-25T14:39:54.301+0800: 5431066.353: [GC2016-02-25T14:39:54.301+0800: 
5431066.353: [ParNew: 1687206K->8257K(1887488K), 0.0174490 secs] 
2908974K->1230036K(8178944K) icms_dc=0 , 0.0176300 secs] [Times: user=0.18 
sys=0.00, real=0.02 secs]
2016-02-25T14:40:43.231+0800: 5431115.284: [GC2016-02-25T14:40:43.231+0800: 
5431115.284: [ParNew: 1686081K->8975K(1887488K), 0.0288890 secs] 
2907860K->1230755K(8178944K) icms_dc=0 , 0.0291150 secs] [Times: user=0.29 
sys=0.01, real=0.03 secs]
2016-02-25T14:45:39.743+0800: 5431411.796: [GC2016-02-25T14:45:39.743+0800: 
5431411.796: [ParNew: 1686799K->22696K(1887488K), 0.0385700 secs] 
2908579K->1244476K(8178944K) icms_dc=0 , 0.0388280 secs] [Times: user=0.23 
sys=0.01, real=0.04 secs]
{code}
So it looks confusion. Attach the gc logs and datanode log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to