Re: region servers dying - flush request - YCSB

陈加俊 Tue, 08 Mar 2011 08:22:36 -0800

Htable had disabled when ctr+c ?

2011/3/8, M.Deniz OKTAR <deniz.ok...@gmail.com>:
> Something new came up!
>
> I tried to truncate the 'usertable' which had ~12M entries.
>
> Shell stayed at "disabling table" for a long time. The processes was there
> but there were no requests. So I quit the state by ctrl-c.
>
> Then tried count 'usertable' to see if data remains, shell gave an error and
> one of the regionservers had a log such as below,
>
> The master logs were also similar (tried to disable again, and the master
> log is from that trial)
>
>
> Regionserver 2:
>
> 2011-03-08 16:47:24,852 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online:
> usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> 2011-03-08 16:47:27,765 DEBUG
> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=39.63 MB,
> free=4.65 GB, max=4.68 GB, blocks=35, accesses=376070, hits=12035,
> hitRatio=3.20%%, cachingAccesses=12070, cachingHits=12035,
> cachingHitsRatio=99.71%%, evictions=0, evicted=0, evictedPerRun=NaN
> 2011-03-08 16:47:28,863 DEBUG
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> NotServingRegionException; Region is not online:
> usertable,,1299593459085.d37bb124feaf8f5d08e51064a36596f8.
> 2011-03-08 16:47:28,865 ERROR
> org.apache.hadoop.hbase.regionserver.HRegionServer:
> org.apache.hadoop.hbase.UnknownScannerException: Name: -1
>         at
> org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:1795)
>         at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
>         at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>         at java.lang.reflect.Method.invoke(Method.java:597)
>         at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>         at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>
>
>
> Masterserver:
> .
> .
> . (same thing)
> 2011-03-08 16:51:34,679 INFO
> org.apache.hadoop.hbase.master.AssignmentManager: Region has been
> PENDING_CLOSE for too long, running forced unassign again on
> region=usertable,user1948102037,1299592536693.d5bae6bbe54aa182e1215ab626e0011e.
>
>
> --
> deniz
>
>
> On Tue, Mar 8, 2011 at 4:34 PM, M.Deniz OKTAR <deniz.ok...@gmail.com> wrote:
>
>> Hi all,
>>
>> Thanks for the support. I'v been trying to replicate the problem since
>> this
>> morning. Before doing that, played with the configuration. I used to have
>> only one user and set all the permissions according to that. Now I'v
>> followed the cloudera manuals and set permissions for hdfs and mapred
>> users.
>> (changed the hbase-env.sh)
>>
>> I had 2 trials, on both the yahoo test failed because of receiving lost of
>> "0"s but the region servers didn't die. At some points in the test, (also
>> when failed) , hbase master gave exceptions about not being able to reach
>> one of the servers. I also lost the ssh connection to that server, but
>> after
>> a while it recovered. (also hmaster) The last thing in the regionserver
>> logs
>> was that it was going for a flush.
>>
>> I'll be going over the tests again and provide you with clean log files
>> from all servers. (hadoop, hbase, namenode, masternode logs)
>>
>> If you have any suggestions or directions for me to better diagnose the
>> problem, that would be lovely.
>>
>> btw: these servers do not have ECC memory but I do not see any corruption
>> in data.
>>
>> Thanks!
>>
>> --
>> deniz
>>
>>
>> On Mon, Mar 7, 2011 at 7:47 PM, Jean-Daniel Cryans
>> <jdcry...@apache.org>wrote:
>>
>>> Along with a bigger portion of the log, it be might good to check if
>>> there's anything in the .out file that looks like a jvm error.
>>>
>>> J-D
>>>
>>> On Mon, Mar 7, 2011 at 9:22 AM, M.Deniz OKTAR <deniz.ok...@gmail.com>
>>> wrote:
>>> > I run every kind of benchmark I could find on those machines and they
>>> seemed
>>> > to work fine. Did memory/disk tests too.
>>> >
>>> > The master node or other nodes provide some information and exceptions
>>> about
>>> > that they can't reach to the dead node.
>>> >
>>> > Btw sometimes the process does not die but looses the connection.
>>> >
>>> > --
>>> >
>>> > deniz
>>> >
>>> > On Mon, Mar 7, 2011 at 7:19 PM, Stack <st...@duboce.net> wrote:
>>> >
>>> >> I'm stumped.  I have nothing to go on when no death throes or
>>> >> complaints.  This hardware for sure is healthy?  Other stuff runs w/o
>>> >> issue?
>>> >> St.Ack
>>> >>
>>> >> On Mon, Mar 7, 2011 at 8:48 AM, M.Deniz OKTAR <deniz.ok...@gmail.com>
>>> >> wrote:
>>> >> > I don't know if its normal but I see alot of '0's in the test
>>> >> > results
>>> >> when
>>> >> > it tends to fail, such as:
>>> >> >
>>> >> >  1196 sec: 7394901 operations; 0 current ops/sec;
>>> >> >
>>> >> > --
>>> >> > deniz
>>> >> >
>>> >> > On Mon, Mar 7, 2011 at 6:46 PM, M.Deniz OKTAR <deniz.ok...@gmail.com
>>> >
>>> >> wrote:
>>> >> >
>>> >> >> Hi,
>>> >> >>
>>> >> >> Thanks for the effort, answers below:
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Mon, Mar 7, 2011 at 6:08 PM, Stack <st...@duboce.net> wrote:
>>> >> >>
>>> >> >>> On Mon, Mar 7, 2011 at 5:43 AM, M.Deniz OKTAR <
>>> deniz.ok...@gmail.com>
>>> >> >>> wrote:
>>> >> >>> > We have a 5 node cluster, 4 of them being region servers. I am
>>> >> running a
>>> >> >>> > custom workload with YCSB and when the data is loading (heavy
>>> insert)
>>> >> at
>>> >> >>> > least one of the region servers are dying after about 600000
>>> >> operations.
>>> >> >>>
>>> >> >>>
>>> >> >>> Tell us the character of your 'custom workload' please.
>>> >> >>>
>>> >> >>>
>>> >> >> The workload is below, the part that fails is the loading part
>>> (-load)
>>> >> >> which inserts all the records first)
>>> >> >>
>>> >> >> recordcount=10000000
>>> >> >> operationcount=3000000
>>> >> >> workload=com.yahoo.ycsb.workloads.CoreWorkload
>>> >> >>
>>> >> >> readallfields=true
>>> >> >>
>>> >> >> readproportion=0.5
>>> >> >> updateproportion=0.1
>>> >> >> scanproportion=0
>>> >> >> insertproportion=0.35
>>> >> >> readmodifywriteproportion=0.05
>>> >> >>
>>> >> >> requestdistribution=zipfian
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>> > There are no abnormalities in the logs as far as I can see, the
>>> only
>>> >> >>> common
>>> >> >>> > point is that all of them(in different trials, different region
>>> >> servers
>>> >> >>> > fail) request for a flush as the last logs, given below. .out
>>> files
>>> >> are
>>> >> >>> > empty. I am looking at the /var/log/hbase folder for logs.
>>> Running
>>> >> sun
>>> >> >>> java
>>> >> >>> > 6 latest version. I couldn't find any logs that indicates a
>>> problem
>>> >> with
>>> >> >>> > java. Tried the tests with openjdk and had the same results.
>>> >> >>> >
>>> >> >>>
>>> >> >>> Its strange that flush is the last thing in your log.  The process
>>> is
>>> >> >>> dead?  We are exiting w/o a note in logs?  Thats unusual.  We
>>> usually
>>> >> >>> scream loudly when dying.
>>> >> >>>
>>> >> >>
>>> >> >> Yes, thats the strange part. The last line is a flush as if the
>>> process
>>> >> >> never failed. Yes, the process is dead and hbase cannot see the
>>> node.
>>> >> >>
>>> >> >>
>>> >> >>>
>>> >> >>> > I have set ulimits(50000) and xceivers(20000) for multiple users
>>> and
>>> >> >>> certain
>>> >> >>> > that they are correct.
>>> >> >>>
>>> >> >>> The first line in an hbase log prints out the ulimit it sees.  You
>>> >> >>> might check that the hbase process for sure is picking up your
>>> ulimit
>>> >> >>> setting.
>>> >> >>>
>>> >> >>> That was a mistake I did a couple of days ago, checked it with cat
>>> >> >> /proc/<pid of reginserver>/limits  and all related users like
>>> 'hbase'
>>> >> has
>>> >> >> those limits. Checked the logs:
>>> >> >>
>>> >> >> Mon Mar  7 06:41:15 EET 2011 Starting regionserver on test-1
>>> >> >> ulimit -n 52768
>>> >> >>
>>> >> >>>
>>> >> >>> > Also in the kernel logs, there are no apparent problems.
>>> >> >>> >
>>> >> >>>
>>> >> >>> (The mystery compounds)
>>> >> >>>
>>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>>> >> >>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>>> Compaction
>>> >> >>> > requested for
>>> >> >>> >
>>> >>
>>> usertable,user1030079237,1299502934627.257739740f58da96d5c5ef51a7d3efc3.
>>> >> >>> > because regionserver60020.cacheFlusher; priority=3, compaction
>>> queue
>>> >> >>> size=18
>>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> >> >>> > NOT flushing memstore for region
>>> >> >>> >
>>> >> >>>
>>> >>
>>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.,
>>> >> >>> > flushing=false, writesEnabled=false
>>> >> >>> > 2011-03-07 15:07:58,301 DEBUG
>>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> >> >>> > Started memstore flush for
>>> >> >>> >
>>> >> >>>
>>> >>
>>> usertable,user1662209069,1299502135191.9fa929e6fb439843cffb604dea3f88f6.,
>>> >> >>> > current region memstore size 68.6m
>>> >> >>> > 2011-03-07 15:07:58,310 DEBUG
>>> >> >>> org.apache.hadoop.hbase.regionserver.HRegion:
>>> >> >>> > Flush requested on
>>> >> >>> >
>>> >>
>>> usertable,user1601881548,1299502135191.f8efb9aa0922fa8a6a53fc49b8155ebc.
>>> >> >>> > -end of log file-
>>> >> >>> > ---
>>> >> >>> >
>>> >> >>>
>>> >> >>> Nothing more?
>>> >> >>>
>>> >> >>>
>>> >> >> No, nothing after that. But quite a lot of logs before that, I can
>>> send
>>> >> >> them if you'd like.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >>> Thanks,
>>> >> >>> St.Ack
>>> >> >>>
>>> >> >>
>>> >> >> Thanks alot!
>>> >> >>
>>> >> >>
>>> >> >
>>> >>
>>> >
>>>
>>
>>
>


-- 
从我的移动设备发送

Thanks & Best regards
jiajun

Re: region servers dying - flush request - YCSB

Reply via email to