+Dev I think number 1 we fix what ever is leaving regions in this state. I think we could put logic into hbck for this.
On Sat, Feb 23, 2013 at 7:36 PM, Jean-Marc Spaggiari < jean-m...@spaggiari.org> wrote: > Hi Kevin, > > I stopped HBase to merge some regions so I already had to deal with the > downtime. But with the online merge coming it's very good to know the > online way to do it. > > Now, is there an automated way to do it? In HBCK? Maybe we can check each > region if there is links, check that those links exist, and if not, we > remove them? Or it will be to risky? > > JM > > > > > > 2013/2/23 Kevin O'dell <kevin.od...@cloudera.com> > > > JM, > > > > Here is what I am seeing: > > > > 2013-02-23 15:46:14,630 ERROR > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: Failed > open > > of > > > > > region=entry,ac.adanac-oidar.www\x1Fhttp\x1F-1\x1F/sports/patinage/2012/04/04/001-artistique-trophee-mondial.shtml\x1Fnull,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435., > > starting to roll back the global memstore size. > > > > If you checked 6dd77bc9ff91e0e6d413f74e670ab435 you should have seen some > > pointer files to 2ebfef593a3d715b59b85670909182c9. Typically, you would > > see the storefiles in 6dd77bc9ff91e0e6d413f74e670ab435 and > > 2ebfef593a3d715b59b85670909182c9 > > would have been empty from a bad split. What I do is to delete the > > pointers that don't reference any storefiles. Then you can clear the > > unassigned folder in zkCli. Finally, run an unassign on the RITs. This > > way there is no down time and you don't have to drop any tables. > > > > > > On Sat, Feb 23, 2013 at 6:14 PM, Jean-Marc Spaggiari < > > jean-m...@spaggiari.org> wrote: > > > > > Hi Kevin, > > > > > > Thanks for taking the time to reply. > > > > > > Here is a bigger extract of the logs. I don't see another path in the > > logs. > > > > > > http://pastebin.com/uMxGyjKm > > > > > > I can send you the entire log if you want (42Mo) > > > > > > What I did is I merged many regions together, then altered the table to > > set > > > the max_filesize and started a major_compaction to get the table > > splitted. > > > > > > To fix the issue I had to drop one working table, and ran -repair > > multiple > > > times. Now it's fixed, but I still have the logs. > > > > > > I'm redoing all the steps I did. Many I will face the issue again. If > I'm > > > able to reproduce, we might be able to figure where the issue is... > > > > > > JM > > > > > > 2013/2/23 Kevin O'dell <kevin.od...@cloudera.com> > > > > > > > JM, > > > > > > > > How are you doing today? Right before the file does not exist > should > > > be > > > > another path. Can you let me know if in that path there are a > pointers > > > > from a split to 2ebfef593a3d715b59b85670909182c9? The directory may > > > > already exist. I have seen this a couple times now and am trying to > > > ferret > > > > out a root cause to open a JIRA with. I suspect we have a split code > > bug > > > > in .92+ > > > > > > > > On Sat, Feb 23, 2013 at 4:10 PM, Jean-Marc Spaggiari < > > > > jean-m...@spaggiari.org> wrote: > > > > > > > > > Hi, > > > > > > > > > > I have 2 regions transitionning from servers to servers for 15 > > minutes > > > > now. > > > > > > > > > > I have nothing in the master logs about those 2 regions but on the > > > region > > > > > server logs I have some files notfound2013-02-23 16:02:07,347 ERROR > > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > Failed > > > > open > > > > > of > > > region=entry,theykey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435., > > > > > starting to roll back the global memstore size. > > > > > java.io.IOException: java.io.IOException: > > > java.io.FileNotFoundException: > > > > > File does not exist: > > > > > > > > > > > > > > > > > > > > /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7 > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.initializeRegionInternals(HRegion.java:597) > > > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.initialize(HRegion.java:510) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4177) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.openHRegion(HRegion.java:4125) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.openRegion(OpenRegionHandler.java:328) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler.process(OpenRegionHandler.java:100) > > > > > at > > > > > > > > > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:169) > > > > > at > > > > > > > > > > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > > > > > at > > > > > > > > > > > > > > > > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > > > > > at java.lang.Thread.run(Thread.java:722) > > > > > Caused by: java.io.IOException: java.io.FileNotFoundException: File > > > does > > > > > not exist: > > > > > > > > > > > > > > > > > > > > /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7 > > > > > at > > > > > > > > > org.apache.hadoop.hbase.regionserver.Store.loadStoreFiles(Store.java:433) > > > > > at > > > org.apache.hadoop.hbase.regionserver.Store.<init>(Store.java:240) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.instantiateHStore(HRegion.java:3141) > > > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:572) > > > > > at > > > > > > org.apache.hadoop.hbase.regionserver.HRegion$1.call(HRegion.java:570) > > > > > at > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > > > > > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > > > at > > > > > > > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > > > > > at > > > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) > > > > > at java.util.concurrent.FutureTask.run(FutureTask.java:166) > > > > > ... 3 more > > > > > Caused by: java.io.FileNotFoundException: File does not exist: > > > > > > > > > > > > > > > > > > > > /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7 > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1843) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1834) > > > > > at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:578) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:154) > > > > > at > > > > > > org.apache.hadoop.fs.FilterFileSystem.open(FilterFileSystem.java:108) > > > > > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:427) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFile.createReaderWithEncoding(HFile.java:573) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFile$Reader.<init>(StoreFile.java:1261) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.HalfStoreFileReader.<init>(HalfStoreFileReader.java:70) > > > > > at > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFile.open(StoreFile.java:508) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFile.createReader(StoreFile.java:603) > > > > > at > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:409) > > > > > at > > > org.apache.hadoop.hbase.regionserver.Store$1.call(Store.java:404) > > > > > ... 8 more > > > > > 2013-02-23 16:02:07,370 WARN > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > regionserver:60020-0x13d07ec012501fc Attempt to transition the > > > unassigned > > > > > node for 6dd77bc9ff91e0e6d413f74e670ab435 from RS_ZK_REGION_OPENING > > to > > > > > RS_ZK_REGION_FAILED_OPEN failed, the node existed but was version > > 6586 > > > > not > > > > > the expected version 6585 > > > > > > > > > > > > > > > If I try hbck -fix, this is bringing the master down: > > > > > 2013-02-23 16:03:01,419 INFO > org.apache.hadoop.hbase.master.HMaster: > > > > > BalanceSwitch=false > > > > > 2013-02-23 16:03:03,067 FATAL > org.apache.hadoop.hbase.master.HMaster: > > > > > Master server abort: loaded coprocessors are: [] > > > > > 2013-02-23 16:03:03,068 FATAL > org.apache.hadoop.hbase.master.HMaster: > > > > > Unexpected state : > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435. > > > > > state=PENDING_OPEN, ts=1361653383067, > > server=node2,60020,1361653023303 > > > .. > > > > > Cannot transit it to OFFLINE. > > > > > java.lang.IllegalStateException: Unexpected state : > > > > > entry,thekey,1361651769136.6dd77bc9ff91e0e6d413f74e670ab435. > > > > > state=PENDING_OPEN, ts=1361653383067, > > server=node2,60020,1361653023303 > > > .. > > > > > Cannot transit it to OFFLINE. > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager.setOfflineInZooKeeper(AssignmentManager.java:1813) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1658) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1423) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1398) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:1393) > > > > > at > > > > > > > org.apache.hadoop.hbase.master.HMaster.assignRegion(HMaster.java:1740) > > > > > at > > org.apache.hadoop.hbase.master.HMaster.assign(HMaster.java:1731) > > > > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > > > > at > > > > > > > > > > > > > > > > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > > > > at > > > > > > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > at java.lang.reflect.Method.invoke(Method.java:601) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:320) > > > > > at > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1426) > > > > > 2013-02-23 16:03:03,069 INFO > org.apache.hadoop.hbase.master.HMaster: > > > > > Aborting > > > > > 2013-02-23 16:03:03,069 INFO org.apache.hadoop.ipc.HBaseServer: > > > Stopping > > > > > server on 60000 > > > > > 2013-02-23 16:03:03,069 INFO > > > > org.apache.hadoop.hbase.master.CatalogJanitor: > > > > > node3,60000,1361653064588-CatalogJanitor exiting > > > > > 2013-02-23 16:03:03,069 INFO > > org.apache.hadoop.hbase.master.HMaster$2: > > > > > node3,60000,1361653064588-BalancerChore exiting > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 5 on 60000: exiting > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 4 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 8 on 60000: exiting > > > > > 2013-02-23 16:03:03,070 INFO > > > > > org.apache.hadoop.hbase.master.cleaner.HFileCleaner: > > > > > master-node3,60000,1361653064588.archivedHFileCleaner exiting > > > > > 2013-02-23 16:03:03,070 INFO > > > > > org.apache.hadoop.hbase.master.cleaner.LogCleaner: > > > > > master-node3,60000,1361653064588.oldLogCleaner exiting > > > > > 2013-02-23 16:03:03,070 INFO > org.apache.hadoop.hbase.master.HMaster: > > > > > Stopping infoServer > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: > > > Stopping > > > > > IPC Server Responder > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: > REPL > > > IPC > > > > > Server handler 1 on 60000: exiting > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: > REPL > > > IPC > > > > > Server handler 2 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > Responder, call isMasterRunning(), rpc version=1, client > version=29, > > > > > methodsFingerPrint=891823089 from 192.168.23.7:43381: output error > > > > > 2013-02-23 16:03:03,071 WARN org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 3 on 60000 caught a ClosedChannelException, this means that > > the > > > > > server was processing a request but the client went away. The error > > > > message > > > > > was: null > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 3 on 60000: exiting > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 1 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 INFO org.mortbay.log: Stopped > > > > > SelectChannelConnector@0.0.0.0:60010 > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: > > > Stopping > > > > > IPC Server Responder > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 6 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 7 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 0 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 2 on 60000: exiting > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: > > > Stopping > > > > > IPC Server listener on 60000 > > > > > 2013-02-23 16:03:03,071 INFO org.apache.hadoop.ipc.HBaseServer: IPC > > > > Server > > > > > handler 9 on 60000: exiting > > > > > 2013-02-23 16:03:03,070 INFO org.apache.hadoop.ipc.HBaseServer: > REPL > > > IPC > > > > > Server handler 0 on 60000: exiting > > > > > 2013-02-23 16:03:03,287 INFO > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: > > > > > Closed zookeeper sessionid=0x33d07f1130301fe > > > > > 2013-02-23 16:03:03,453 INFO > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimerUpdater: > > > > > node3,60000,1361653064588.timerUpdater exiting > > > > > 2013-02-23 16:03:03,453 INFO > > > > > org.apache.hadoop.hbase.master.AssignmentManager$TimeoutMonitor: > > > > > node3,60000,1361653064588.timeoutMonitor exiting > > > > > 2013-02-23 16:03:03,453 INFO > > > > > org.apache.hadoop.hbase.master.SplitLogManager$TimeoutMonitor: > > > > > node3,60000,1361653064588.splitLogManagerTimeoutMonitor exiting > > > > > 2013-02-23 16:03:03,468 INFO > org.apache.hadoop.hbase.master.HMaster: > > > > > HMaster main thread exiting > > > > > 2013-02-23 16:03:03,469 ERROR > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start > > > master > > > > > java.lang.RuntimeException: HMaster Aborted > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:160) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:104) > > > > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > > > > at > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) > > > > > at > org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1927) > > > > > > > > > > I'm running with 0.94.5 + > > > > > HBASE-7824<https://issues.apache.org/jira/browse/HBASE-7824>+ > > > > > HBASE-7865 <https://issues.apache.org/jira/browse/HBASE-7865>. I > > don't > > > > > think the 2 patchs are related to this issue. > > > > > > > > > > Hadoop fsck reports "The filesystem under path '/' is HEALTHY" > > without > > > > any > > > > > issue. > > > > > > > > > > > > > > > > > > > > /hbase/entry/2ebfef593a3d715b59b85670909182c9/a/62b0aae45d59408dbcfc513954efabc7 > > > > > does exist in the FS. > > > > > > > > > > What I don't understand is why is the master going down? And how > can > > I > > > > fix > > > > > that? > > > > > > > > > > I will try to create the missing directory and see the results... > > > > > > > > > > Thanks, > > > > > > > > > > JM > > > > > > > > > > > > > > > > > > > > > -- > > > > Kevin O'Dell > > > > Customer Operations Engineer, Cloudera > > > > > > > > > > > > > > > -- > > Kevin O'Dell > > Customer Operations Engineer, Cloudera > > > -- Kevin O'Dell Customer Operations Engineer, Cloudera