Htable had disabled when ctr+c ? 2011/3/8, M.Deniz OKTAR <deniz.ok...@gmail.com>: > Something new came up! > > I tried to truncate the 'usertable' which had ~12M entries. > > Shell stayed at "disabling table" for a long time. The processes was there > but there were no requests. So I quit the state by ctrl-c. > > Then tried count 'usertable' to see if data remains, shell gave an error and > one of the regionservers had a log such as below, > > The master logs were also similar (tried to disable again, and the master > log is from that trial) > > > Regionserver 2: > > 2011-03-08 16:47:24,852 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > 2011-03-08 16:47:27,765 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63 MB, > free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035, > hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035, > cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN > 2011-03-08 16:47:28,863 DEBUG > org.apache.hadoop.hbase.regionserver.HRegionServer: > NotServingRegionException; Region is not online: > usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8. > 2011-03-08 16:47:28,865 ERROR > org.apache.hadoop.hbase.regionserver.HRegionServer: > org.apache.hadoop.hbase.UnknownScannerException: Name: -1 > at > org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795) > at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > > > > Masterserver: > . > . > . (same thing) > 2011-03-08 16:51:34,679 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Region has been > PENDING_CLOSE for too long, running forced unassign again on > region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e. > > > -- > deniz > > > On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <deniz.ok...@gmail.com> wrote: > >> Hi all, >> >> Thanks for the support. I'v been trying to replicate the problem since >> this >> morning. Before doing that, played with the configuration. I used to have >> only one user and set all the permissions according to that. Now I'v >> followed the cloudera manuals and set permissions for hdfs and mapred >> users. >> (changed the hbase-env.sh) >> >> I had 2 trials, on both the yahoo test failed because of receiving lost of >> "0"s but the region servers didn't die. At some points in the test, (also >> when failed) , hbase master gave exceptions about not being able to reach >> one of the servers. I also lost the ssh connection to that server, but >> after >> a while it recovered. (also hmaster) The last thing in the regionserver >> logs >> was that it was going for a flush. >> >> I'll be going over the tests again and provide you with clean log files >> from all servers. (hadoop, hbase, namenode, masternode logs) >> >> If you have any suggestions or directions for me to better diagnose the >> problem, that would be lovely. >> >> btw: these servers do not have ECC memory but I do not see any corruption >> in data. >> >> Thanks! >> >> -- >> deniz >> >> >> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans >> <jdcry...@apache.org>wrote: >> >>> Along with a bigger portion of the log, it be might good to check if >>> there's anything in the .out file that looks like a jvm error. >>> >>> J-D >>> >>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <deniz.ok...@gmail.com> >>> wrote: >>> > I run every kind of benchmark I could find on those machines and they >>> seemed >>> > to work fine. Did memory/disk tests too. >>> > >>> > The master node or other nodes provide some information and exceptions >>> about >>> > that they can't reach to the dead node. >>> > >>> > Btw sometimes the process does not die but looses the connection. >>> > >>> > -- >>> > >>> > deniz >>> > >>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <st...@duboce.net> wrote: >>> > >>> >> I'm stumped. I have nothing to go on when no death throes or >>> >> complaints. This hardware for sure is healthy? Other stuff runs w/o >>> >> issue? >>> >> St.Ack >>> >> >>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <deniz.ok...@gmail.com> >>> >> wrote: >>> >> > I don't know if its normal but I see alot of '0's in the test >>> >> > results >>> >> when >>> >> > it tends to fail, such as: >>> >> > >>> >> > 1196 sec: 7394901 operations; 0 current ops/sec; >>> >> > >>> >> > -- >>> >> > deniz >>> >> > >>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <deniz.ok...@gmail.com >>> > >>> >> wrote: >>> >> > >>> >> >> Hi, >>> >> >> >>> >> >> Thanks for the effort, answers below: >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <st...@duboce.net> wrote: >>> >> >> >>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR < >>> deniz.ok...@gmail.com> >>> >> >>> wrote: >>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am >>> >> running a >>> >> >>> > custom workload with YCSB and when the data is loading (heavy >>> insert) >>> >> at >>> >> >>> > least one of the region servers are dying after about 600000 >>> >> operations. >>> >> >>> >>> >> >>> >>> >> >>> Tell us the character of your 'custom workload' please. >>> >> >>> >>> >> >>> >>> >> >> The workload is below, the part that fails is the loading part >>> (-load) >>> >> >> which inserts all the records first) >>> >> >> >>> >> >> recordcount=10000000 >>> >> >> operationcount=3000000 >>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload >>> >> >> >>> >> >> readallfields=true >>> >> >> >>> >> >> readproportion=0.5 >>> >> >> updateproportion=0.1 >>> >> >> scanproportion=0 >>> >> >> insertproportion=0.35 >>> >> >> readmodifywriteproportion=0.05 >>> >> >> >>> >> >> requestdistribution=zipfian >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >>> >>> >> >>> > There are no abnormalities in the logs as far as I can see, the >>> only >>> >> >>> common >>> >> >>> > point is that all of them(in different trials, different region >>> >> servers >>> >> >>> > fail) request for a flush as the last logs, given below. .out >>> files >>> >> are >>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs. >>> Running >>> >> sun >>> >> >>> java >>> >> >>> > 6 latest version. I couldn't find any logs that indicates a >>> problem >>> >> with >>> >> >>> > java. Tried the tests with openjdk and had the same results. >>> >> >>> > >>> >> >>> >>> >> >>> Its strange that flush is the last thing in your log. The process >>> is >>> >> >>> dead? We are exiting w/o a note in logs? Thats unusual. We >>> usually >>> >> >>> scream loudly when dying. >>> >> >>> >>> >> >> >>> >> >> Yes, thats the strange part. The last line is a flush as if the >>> process >>> >> >> never failed. Yes, the process is dead and hbase cannot see the >>> node. >>> >> >> >>> >> >> >>> >> >>> >>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users >>> and >>> >> >>> certain >>> >> >>> > that they are correct. >>> >> >>> >>> >> >>> The first line in an hbase log prints out the ulimit it sees. You >>> >> >>> might check that the hbase process for sure is picking up your >>> ulimit >>> >> >>> setting. >>> >> >>> >>> >> >>> That was a mistake I did a couple of days ago, checked it with cat >>> >> >> /proc/<pid of reginserver>/limits and all related users like >>> 'hbase' >>> >> has >>> >> >> those limits. Checked the logs: >>> >> >> >>> >> >> Mon Mar 7 06:41:15 EET 2011 Starting regionserver on test-1 >>> >> >> ulimit -n 52768 >>> >> >> >>> >> >>> >>> >> >>> > Also in the kernel logs, there are no apparent problems. >>> >> >>> > >>> >> >>> >>> >> >>> (The mystery compounds) >>> >> >>> >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: >>> Compaction >>> >> >>> > requested for >>> >> >>> > >>> >> >>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3. >>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction >>> queue >>> >> >>> size=18 >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >>> >> >>> > NOT flushing memstore for region >>> >> >>> > >>> >> >>> >>> >> >>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc., >>> >> >>> > flushing=false, writesEnabled=false >>> >> >>> > 2011-03-07 15:07:58,301 DEBUG >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >>> >> >>> > Started memstore flush for >>> >> >>> > >>> >> >>> >>> >> >>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6., >>> >> >>> > current region memstore size 68.6m >>> >> >>> > 2011-03-07 15:07:58,310 DEBUG >>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion: >>> >> >>> > Flush requested on >>> >> >>> > >>> >> >>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc. >>> >> >>> > -end of log file- >>> >> >>> > --- >>> >> >>> > >>> >> >>> >>> >> >>> Nothing more? >>> >> >>> >>> >> >>> >>> >> >> No, nothing after that. But quite a lot of logs before that, I can >>> send >>> >> >> them if you'd like. >>> >> >> >>> >> >> >>> >> >> >>> >> >>> Thanks, >>> >> >>> St.Ack >>> >> >>> >>> >> >> >>> >> >> Thanks alot! >>> >> >> >>> >> >> >>> >> > >>> >> >>> > >>> >> >> >
-- 从我的移动设备发送 Thanks & Best regards jiajun