[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817363#comment-13817363 ] Jean-Marc Spaggiari commented on HBASE-6461: No-one look at this issue? Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423695#comment-13423695 ] Lars Hofhansl commented on HBASE-6461: -- I thought HBase neither uses nor needs append (the reason for initial using the append branch was that it also had the sync patches). Where do we ever rely on a reader and a writer concurrently reader-from/writing-to the same file (Hlog in this case). Is this only a special case for the WAL when we start to replay the edits, when the file is not closed, yet, because the writer died? Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423700#comment-13423700 ] Elliott Clark commented on HBASE-6461: -- @nkeywal I'm not seeing that log exception. However everything else is exactly the same. This only shows when I kill the DN and RS at the same time. And does seem to manifest itself more when the log splitting starts very soon after the java processes are killed. Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423712#comment-13423712 ] ramkrishna.s.vasudevan commented on HBASE-6461: --- Sorry for pitching in late here. We faced some issues related to 0 length file in Hlogs. As Uma has pointed out in HDFS-3701 or HDFS-3222 we had some patches from HDFS and the same some work around was done on HBASE side also to fix this issue internally. I shall discuss with Uma further on this. Thanks all. Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423883#comment-13423883 ] nkeywal commented on HBASE-6461: @Ram Thanks a lot! @Lars bq. I thought HBase neither uses nor needs append (the reason for initial using the append branch was that it also had the sync patches). Yes, the log line is actually confusing, you just need to have someone writing a file and someone else reading it. bq. Is this only a special case for the WAL when we start to replay the edits, when the file is not closed, yet, because the writer died? I think so. Even if I wonder if we could not have similar stuff for large store hfiles + crashes, even if on paper it should be ok. Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13424012#comment-13424012 ] Lars Hofhansl commented on HBASE-6461: -- I think HFiles are fine, since new HFiles are (or should not) be visible until the write succeeded. A compaction write a new file, only when that succeeds are the old files deleted. Flushes are more interesting. If a flush fails half-way because the RS/DN dies, the in memory state will be lost and the HLog is the source of truth. But the HFile have been 1/2 written. Hmmm. Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423474#comment-13423474 ] Lars Hofhansl commented on HBASE-6461: -- Marked Critical against 0.94.2. Could also see this as a blocker for 0.94.1 (Although 0.94.1 needs to go out at some point, because of HBASE-6311) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423522#comment-13423522 ] Lars Hofhansl commented on HBASE-6461: -- Any chance to repeat this with distributed log splitting disabled? Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423535#comment-13423535 ] Elliott Clark commented on HBASE-6461: -- bq.Any chance to repeat this with distributed log splitting disabled? Sure gimme a little bit. Reading the logs on my latest example. Once JD and I are done (assuming that we don't solve it) I'll try with distributed log splitting off. On the latest run JD noticed that this in the logs of the server that was doing the splitting (There was only one hlog nothing had rolled yet). 10.4.3.44 is the server that had all java processes killed. {noformat} 453-2012-07-26 22:13:18,079 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: File hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 might be still open, length is 0 454-2012-07-26 22:13:18,081 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Recovering file hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 455-2012-07-26 22:13:19,083 INFO org.apache.hadoop.hbase.util.FSHDFSUtils: Finished lease recover attempt for hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 456-2012-07-26 22:13:20,091 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 0 time(s). 457-2012-07-26 22:13:21,092 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 1 time(s). 458-2012-07-26 22:13:22,093 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 2 time(s). 459-2012-07-26 22:13:23,093 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 3 time(s). 460-2012-07-26 22:13:24,094 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 4 time(s). 461-2012-07-26 22:13:25,095 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 5 time(s). 462-2012-07-26 22:13:26,096 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 6 time(s). 463-2012-07-26 22:13:27,096 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 7 time(s). 464-2012-07-26 22:13:27,936 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=50.28 MB, free=5.94 GB, max=5.99 GB, blocks=0, accesses=0, hits=0, hitRatio=0cachingAccesses=0, cachingHits=0, cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN 465-2012-07-26 22:13:28,097 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 8 time(s). 466-2012-07-26 22:13:29,098 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /10.4.3.44:9910. Already tried 9 time(s). 467:2012-07-26 22:13:29,100 WARN org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Could not open hdfs://sv4r11s38:9901/hbase/.logs/sv4r3s44,9913,1343340207863-splitting/sv4r3s44%2C9913%2C1343340207863.1343340210132 for reading. File is empty 468-java.io.EOFException 469-at java.io.DataInputStream.readFully(DataInputStream.java:180) 470-at java.io.DataInputStream.readFully(DataInputStream.java:152) 471-at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1521) 472-at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1493) 473-at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1480) 474-at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1475) 475-at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader$WALReader.init(SequenceFileLogReader.java:55) 476-at org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogReader.init(SequenceFileLogReader.java:175) 477-at org.apache.hadoop.hbase.regionserver.wal.HLog.getReader(HLog.java:716) 478-at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:821) 479-at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.getReader(HLogSplitter.java:734) 480-at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:381) 481-at org.apache.hadoop.hbase.regionserver.wal.HLogSplitter.splitLogFile(HLogSplitter.java:348) 482-at org.apache.hadoop.hbase.regionserver.SplitLogWorker$1.exec(SplitLogWorker.java:111) 483-at org.apache.hadoop.hbase.regionserver.SplitLogWorker.grabTask(SplitLogWorker.java:264) 484-at org.apache.hadoop.hbase.regionserver.SplitLogWorker.taskLoop(SplitLogWorker.java:195) 485-at org.apache.hadoop.hbase.regionserver.SplitLogWorker.run(SplitLogWorker.java:163) 486-at java.lang.Thread.run(Thread.java:662) {noformat} So the log splitter is trying the dead server over and over. Then failing
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423601#comment-13423601 ] Elliott Clark commented on HBASE-6461: -- Small update: I tried it again(well 30 times actually) with more logs enabled. and I noticed this in the NameNode {noformat} eclark@sv4r11s38:/export1/eclark$ grep recovery started /export1/eclark/logs/hadoop-eclark-namenode-sv4r11s38.log 2012-07-27 00:39:45,094 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* blk_3380368109770176913_1021 recovery started, primary=10.4.6.38:9908 {noformat} where the primary listed is the server that was just killed. and the block id is the id for the RegionServer's hlog. According to the comments around the log message the primary is supposed to be an alive data node. I'm wondering if this is an hdfs bug. Thoughts ? Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423605#comment-13423605 ] Zhihong Ted Yu commented on HBASE-6461: --- So by 2012-07-27 00:39:45,094 , java processes on 10.4.6.38 had been dead ? For how long ? Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423608#comment-13423608 ] Lars Hofhansl commented on HBASE-6461: -- Related to HBASE-6401? Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423654#comment-13423654 ] Elliott Clark commented on HBASE-6461: -- @Ted ~30 seconds @Lars Yep it seems like exactly the same. Root is getting created from a 0 Length hlog. Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HBASE-6461) Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table.
[ https://issues.apache.org/jira/browse/HBASE-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13423689#comment-13423689 ] nkeywal commented on HBASE-6461: For HBASE-6401, the hdfs jira is HDFS-3701, Uma said he will have a patch next week. You can validate that it's the root cause by changing the log level to debug for org.apache.hadoop.hdfs.DFSClient. You should get this log line: LOG.debug(DFSClient file + src + is being concurrently append to + but datanode + primaryNode.getHostName() + probably does not have block + last.getBlock()); There is a patch in HBASE-6435. It lowers a lot the probability to hit this bug, so you may want to try it (it's not totally finished however, but some reviews will help, so... :-) ). As well it's for trunk but should go easily on 0.94 Killing the HRegionServer and DataNode hosting ROOT can result in a malformed root table. - Key: HBASE-6461 URL: https://issues.apache.org/jira/browse/HBASE-6461 Project: HBase Issue Type: Bug Environment: hadoop-0.20.2-cdh3u3 HBase 0.94.1 RC1 Reporter: Elliott Clark Priority: Critical Fix For: 0.94.2 Spun up a new dfs on hadoop-0.20.2-cdh3u3 Started hbase started running loadtest tool. killed rs and dn holding root with killall -9 java on server sv4r27s44 at about 2012-07-25 22:40:00 After things stabilize Root is in a bad state. Ran hbck and got: Exception in thread main org.apache.hadoop.hbase.client.NoServerForRegionException: No server address listed in -ROOT- for region .META.,,1.1028785192 containing row at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:1016) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:841) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:810) at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:232) at org.apache.hadoop.hbase.client.HTable.init(HTable.java:172) at org.apache.hadoop.hbase.util.HBaseFsck.connect(HBaseFsck.java:241) at org.apache.hadoop.hbase.util.HBaseFsck.main(HBaseFsck.java:3236) hbase(main):001:0 scan '-ROOT-' ROW COLUMN+CELL 12/07/25 22:43:18 INFO security.UserGroupInformation: JAAS Configuration already set up for Hadoop, not re-installing. .META.,,1column=info:regioninfo, timestamp=1343255838525, value={NAME = '.META.,,1', STARTKEY = '', ENDKEY = '', ENCODED = 1028785192,} .META.,,1column=info:v, timestamp=1343255838525, value=\x00\x00 1 row(s) in 0.5930 seconds Here's the master log: https://gist.github.com/3179194 I tried the same thing with 0.92.1 and I was able to get into a similar situation, so I don't think this is anything new. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira