[ https://issues.apache.org/jira/browse/HBASE-8912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861925#comment-13861925 ]
Jean-Marc Spaggiari edited comment on HBASE-8912 at 1/3/14 10:54 PM: --------------------------------------------------------------------- After the first restart, 36 regions are stuck in transition :( But not any server crashed. What I did: - Restored default balancer to make sure as much regions as possible will move. - Stop/start HBase - Run balancer from shell. Every thing is back up after a 2nd restart. I get many errors like this one: {code} 2014-01-03 16:03:03,958 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region b75cb9067c3c4456d6198c9237c143b3 from server node4.domain.com,60020,1388782921790 but region was in the state page,rf.idua.www\x1Fhttp\x1F-1\x1F/fr/brand/fr/audi_fleet_solutions/contact/contact_transport_personnes.html\x1Fnull,1379103792232.b75cb9067c3c4456d6198c9237c143b3. state=CLOSED, ts=1388782983373, server=node4.domain.com,60020,1388782921790 and not in OFFLINE, PENDING_OPEN or OPENING {code} After investigations, I figured that snappy was missing on a server. I fixed that, restart: All seems to be fine. So I restored my customized balancer, restart, balanced. Still some warning in the logs: {code} 2014-01-03 16:21:52,864 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region db8e67acde26bf340da481d3c1b934cd from server node4.domain.com,60020,1388784051197 but region was in the state page,moc.tenretnigruoboc.www\x1Fhttp\x1F-1\x1F/cobourg-and-the-web\x1Fnull,1379103844627.db8e67acde26bf340da481d3c1b934cd. state=OPEN, ts=1388784100392, server=node4.domain.com,60020,1388784051197 and not in expected OFFLINE, PENDING_OPEN or OPENING states {code} But this time all the regions are assigned correctly. I did that one more time (change balancer, stop, start, balance. Change balancer, stop, start, balance). I turned loglevel to warn. {code} 2014-01-03 16:28:51,142 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 17bee313797fc1ce982c0e31fdb6620c from server node8.domain.com,60020,1388784498327 but region was in the state page,rf.ofniecnarf.www\x1Fhttp\x1F-1\x1F/vote/comment/27996/1/vote/zero_vote/c99b0992e5a9cd6bf3a4cfc91769ceeb\x1Fnull,1379104524006.17bee313797fc1ce982c0e31fdb6620c. state=OPEN, ts=1388784531048, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,135 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 6dc6290df1855b319f60bf89faa3da41 from server node8.domain.com,60020,1388784498327 but region was in the state page_crc,\x00\x00\x00\x00\xD7\xD9\x97\x8Bvideo.k-wreview.ca,1378042601904.6dc6290df1855b319f60bf89faa3da41. state=OPEN, ts=1388784531793, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,712 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region ec4f96b6cedd935aeba279b15d5337af from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\x98\xBF\xAF\x90\x00\x00\x00\x00http://feedproxy.google.com/~r/WheatWeeds/~3/Of24fZKcpco/the-eighth-day-of-christmas.html,1378975430143.ec4f96b6cedd935aeba279b15d5337af. state=OPEN, ts=1388784532540, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,747 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 4f823b5de664556a89cbd86aa41cd0b0 from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\x8D4K\xEA\x00\x00\x00\x00http://twitter.com/home?status=CartoonStock%3A++http%3A%2F%2Fwww%2Ecartoonstock%2Ecom%2Fdirectory%2Fc%2Fcream%5Ftea%5Fgifts%2Easp,1378681682935.4f823b5de664556a89cbd86aa41cd0b0. state=OPEN, ts=1388784532552, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:53,244 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region da0bd0a6b7187f731fb34d4ac14ca279 from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\xB2\xE6\xB6\xBB\x00\x00\x00\x00http://www.canpages.ca/page/QC/notre-dame-des-prairies/concept-beton-design/4550984.html,1378737981443.da0bd0a6b7187f731fb34d4ac14ca279. state=OPEN, ts=1388784533203, server=node8.domain.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states {code} But everything finally got assigned without any restart required, any pretty quickly. Logs from the last run: {code} 2014-01-03 16:32:20,252 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":10969,"call":"balance(), rpc version=1, client version=29, methodsFingerPrint=1886733559","client":"192.168.23.7:54614","starttimems":1388784729247,"queuetimems":0,"class":"HMaster","responsesize":0,"method":"balance"} 2014-01-03 16:32:21,278 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 043d45cada6185d86e743754957e579a from server node1.domain.com,60020,1388784692832 but region was in the state page,moc.yubreffotseb.www\x1Fhttp\x1F-1\x1F/camera-bags-cases-straps-camera-bags-cases-c-282_888_580.html\x1Fzenid=ji3nr2ps8rnbaa7joc0lv4qln2,1388782516646.043d45cada6185d86e743754957e579a. state=OPEN, ts=1388784735731, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:21,713 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region e097f829b6eafb70b30c254fd4af662c from server node1.domain.com,60020,1388784692832 but region was in the state page,ac.usneeuq.ssenisub\x1Fhttp\x1F-1\x1F/grad_studies/PHD/about_us/queens_leaders_forum/about_us/about_us/grad_studies/PHD/student_career_services/queens_leaders_forum/recruiting/news/recruiting/about_us/accreditations.php\x1Fnull,1383168138496.e097f829b6eafb70b30c254fd4af662c. state=OPEN, ts=1388784736528, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:26,862 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region c94705f4a7c23a0a05a01bfe9d7755bc from server node1.domain.com,60020,1388784692832 but region was in the state entry,christian_labelle,1377000858428.c94705f4a7c23a0a05a01bfe9d7755bc. state=OPEN, ts=1388784740724, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:34,516 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region a8d60db86bd03cfbfba0bae0bb3cb564 from server node1.domain.com,60020,1388784692832 but region was in the state work_proposed,W{8\xB2\x00\x00\x00\x00http://www.prairiesouth.ca/williamgrayson/calendar-mainmenu-26/day.listevents/2013/10/29/23.html,1383415634227.a8d60db86bd03cfbfba0bae0bb3cb564. state=OPEN, ts=1388784754264, server=node1.domain.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states {code} So overall, it's WAY more stable! I have not been able to get anything stuck or crashed with the 2 patchs applied. I will keep them ;) Big +1 from me. Thanks for fixing that. I think it might be easy for fix the last few remaining warnings... was (Author: jmspaggi): After the first restart, 36 regions are stuck in transition :( But not any server crashed. What I did: - Restored default balancer to make sure as much regions as possible will move. - Stop/start HBase - Run balancer from shell. Every thing is back up after a 2nd restart. I get many errors like this one: {code} 2014-01-03 16:03:03,958 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received FAILED_OPEN for region b75cb9067c3c4456d6198c9237c143b3 from server node4.domain.com,60020,1388782921790 but region was in the state page,rf.idua.www\x1Fhttp\x1F-1\x1F/fr/brand/fr/audi_fleet_solutions/contact/contact_transport_personnes.html\x1Fnull,1379103792232.b75cb9067c3c4456d6198c9237c143b3. state=CLOSED, ts=1388782983373, server=node4.domain.com,60020,1388782921790 and not in OFFLINE, PENDING_OPEN or OPENING {code} After investigations, I figured that snappy was missing on a server. I fixed that, restart: All seems to be fine. So I restored my customized balancer, restart, balanced. Still some warning in the logs: {code} 2014-01-03 16:21:52,864 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region db8e67acde26bf340da481d3c1b934cd from server node4.domain.com,60020,1388784051197 but region was in the state page,moc.tenretnigruoboc.www\x1Fhttp\x1F-1\x1F/cobourg-and-the-web\x1Fnull,1379103844627.db8e67acde26bf340da481d3c1b934cd. state=OPEN, ts=1388784100392, server=node4.distparser.com,60020,1388784051197 and not in expected OFFLINE, PENDING_OPEN or OPENING states {code} But this time all the regions are assigned correctly. I did that one more time (change balancer, stop, start, balance. Change balancer, stop, start, balance). I turned loglevel to warn. {code} 2014-01-03 16:28:51,142 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 17bee313797fc1ce982c0e31fdb6620c from server node8.domain.com,60020,1388784498327 but region was in the state page,rf.ofniecnarf.www\x1Fhttp\x1F-1\x1F/vote/comment/27996/1/vote/zero_vote/c99b0992e5a9cd6bf3a4cfc91769ceeb\x1Fnull,1379104524006.17bee313797fc1ce982c0e31fdb6620c. state=OPEN, ts=1388784531048, server=node8.distparser.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,135 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 6dc6290df1855b319f60bf89faa3da41 from server node8.domain.com,60020,1388784498327 but region was in the state page_crc,\x00\x00\x00\x00\xD7\xD9\x97\x8Bvideo.k-wreview.ca,1378042601904.6dc6290df1855b319f60bf89faa3da41. state=OPEN, ts=1388784531793, server=node8.distparser.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,712 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region ec4f96b6cedd935aeba279b15d5337af from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\x98\xBF\xAF\x90\x00\x00\x00\x00http://feedproxy.google.com/~r/WheatWeeds/~3/Of24fZKcpco/the-eighth-day-of-christmas.html,1378975430143.ec4f96b6cedd935aeba279b15d5337af. state=OPEN, ts=1388784532540, server=node8.distparser.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:52,747 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 4f823b5de664556a89cbd86aa41cd0b0 from server node8.distparser.com,60020,1388784498327 but region was in the state work_proposed,\x8D4K\xEA\x00\x00\x00\x00http://twitter.com/home?status=CartoonStock%3A++http%3A%2F%2Fwww%2Ecartoonstock%2Ecom%2Fdirectory%2Fc%2Fcream%5Ftea%5Fgifts%2Easp,1378681682935.4f823b5de664556a89cbd86aa41cd0b0. state=OPEN, ts=1388784532552, server=node8.distparser.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:28:53,244 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region da0bd0a6b7187f731fb34d4ac14ca279 from server node8.domain.com,60020,1388784498327 but region was in the state work_proposed,\xB2\xE6\xB6\xBB\x00\x00\x00\x00http://www.canpages.ca/page/QC/notre-dame-des-prairies/concept-beton-design/4550984.html,1378737981443.da0bd0a6b7187f731fb34d4ac14ca279. state=OPEN, ts=1388784533203, server=node8.distparser.com,60020,1388784498327 and not in expected OFFLINE, PENDING_OPEN or OPENING states {code} But everything finally got assigned without any restart required, any pretty quickly. Logs from the last run: {code} 2014-01-03 16:32:20,252 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":10969,"call":"balance(), rpc version=1, client version=29, methodsFingerPrint=1886733559","client":"192.168.23.7:54614","starttimems":1388784729247,"queuetimems":0,"class":"HMaster","responsesize":0,"method":"balance"} 2014-01-03 16:32:21,278 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region 043d45cada6185d86e743754957e579a from server node1.distparser.com,60020,1388784692832 but region was in the state page,moc.yubreffotseb.www\x1Fhttp\x1F-1\x1F/camera-bags-cases-straps-camera-bags-cases-c-282_888_580.html\x1Fzenid=ji3nr2ps8rnbaa7joc0lv4qln2,1388782516646.043d45cada6185d86e743754957e579a. state=OPEN, ts=1388784735731, server=node1.distparser.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:21,713 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region e097f829b6eafb70b30c254fd4af662c from server node1.distparser.com,60020,1388784692832 but region was in the state page,ac.usneeuq.ssenisub\x1Fhttp\x1F-1\x1F/grad_studies/PHD/about_us/queens_leaders_forum/about_us/about_us/grad_studies/PHD/student_career_services/queens_leaders_forum/recruiting/news/recruiting/about_us/accreditations.php\x1Fnull,1383168138496.e097f829b6eafb70b30c254fd4af662c. state=OPEN, ts=1388784736528, server=node1.distparser.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:26,862 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region c94705f4a7c23a0a05a01bfe9d7755bc from server node1.distparser.com,60020,1388784692832 but region was in the state entry,christian_labelle,1377000858428.c94705f4a7c23a0a05a01bfe9d7755bc. state=OPEN, ts=1388784740724, server=node1.distparser.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states 2014-01-03 16:32:34,516 WARN org.apache.hadoop.hbase.master.AssignmentManager: Received OPENED for region a8d60db86bd03cfbfba0bae0bb3cb564 from server node1.distparser.com,60020,1388784692832 but region was in the state work_proposed,W{8\xB2\x00\x00\x00\x00http://www.prairiesouth.ca/williamgrayson/calendar-mainmenu-26/day.listevents/2013/10/29/23.html,1383415634227.a8d60db86bd03cfbfba0bae0bb3cb564. state=OPEN, ts=1388784754264, server=node1.distparser.com,60020,1388784692832 and not in expected OFFLINE, PENDING_OPEN or OPENING states {code} So overall, it's WAY more stable! I have not been able to get anything stuck or crashed with the 2 patchs applied. I will keep them ;) Big +1 from me. Thanks for fixing that. I think it might be easy for fix the last few remaining warnings... > [0.94] AssignmentManager throws IllegalStateException from PENDING_OPEN to > OFFLINE > ---------------------------------------------------------------------------------- > > Key: HBASE-8912 > URL: https://issues.apache.org/jira/browse/HBASE-8912 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Lars Hofhansl > Priority: Critical > Fix For: 0.94.16 > > Attachments: 8912-0.94-alt2.txt, 8912-0.94.txt, 8912-fix-race.txt, > HBASE-8912.patch, HBase-0.94 #1036 test - testRetrying [Jenkins].html, > log.txt, org.apache.hadoop.hbase.catalog.TestMetaReaderEditor-output.txt > > > AM throws this exception which subsequently causes the master to abort: > {code} > java.lang.IllegalStateException: Unexpected state : > testRetrying,jjj,1372891751115.9b828792311001062a5ff4b1038fe33b. > state=PENDING_OPEN, ts=1372891751912, > server=hemera.apache.org,39064,1372891746132 .. Cannot transit it to OFFLINE. > at > org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1879) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1688) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1424) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1399) > at > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1394) > at > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:105) > at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:175) > at > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) > at java.lang.Thread.run(Thread.java:662) > {code} > This exception trace is from the failing test TestMetaReaderEditor which is > failing pretty frequently, but looking at the test code, I think this is not > a test-only issue, but affects the main code path. > https://builds.apache.org/job/HBase-0.94/1036/testReport/junit/org.apache.hadoop.hbase.catalog/TestMetaReaderEditor/testRetrying/ -- This message was sent by Atlassian JIRA (v6.1.5#6160)