Re: Regionserver fails to serve region

stack Tue, 28 Oct 2008 10:13:04 -0700

I took a quick look Slava (Thanks for sending the files). Here's a fewnotes:

+ The logs are from after the damage is done; the transition from goodto bad is missing. If I could see that, that would help+ But what seems to be plain is that that your HDFS is very sick. Seethis from head of one of the regionserver logs:

2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient:DataStreamer Exception: java.io.IOException: Unable to create new block.atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)atorg.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: ErrorRecovery for block blk_-5188192041705782716_60000 bad datanode[0]2008-10-27 23:41:12,685 ERRORorg.apache.hadoop.hbase.regionserver.CompactSplitThread:Compaction/Split failed for regionBizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518

java.io.IOException: Could not get block locations. Aborting...

If HDFS is ailing, hbase is too. In fact, the regionservers will shutthemselves to protect themselves against damaging or losing data:

2008-10-27 23:41:12,688 FATALorg.apache.hadoop.hbase.regionserver.Flusher: Replay of hlog required.Forcing server restart

So, whats up with your HDFS? Not enough space alloted? What happens ifyou run "./bin/hadoop fsck /"? Does that give you a clue as to whathappened? Dig in the datanode and namenode logs. Look for where theexceptions start. It might give you a clue.


+ The suse regionserver log had garbage in it.

St.Ack


Slava Gorelik wrote:

Hi.

My happiness was very short :-( After i successfully added 1M rows(50k each row) i tried to add 10M rows.And after 3-4 working hours it started to dying. First one regionserver is died, after another one and eventually all cluster is dead.

I attached log files (relevant part, archived) from region servers andfrom the master.


Best Regards.

On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:


    Hi.
    So far so good, after changing the file descriptors
    and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers

my cluster works stable.

    Thank You and Best Regards.

    P.S. Regarding deleting multiple columns missing functionality i
    filled jira : https://issues.apache.org/jira/browse/HBASE-961



    On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <[EMAIL PROTECTED]
    <mailto:[EMAIL PROTECTED]>> wrote:

        Slava Gorelik wrote:

            Hi.Haven't tried yet them, i'll try tomorrow morning. In
            general cluster is
            working well, the problems begins if i'm trying to add 10M
            rows, after 1.2M
            if happened.

        Anything else running beside the regionserver or datanodes
        that would suck resources?  When datanodes begin to slow, we
        begin to see the issue Jean-Adrien's configurations address.
         Are you uploading using MapReduce?  Are TTs running on same
        nodes as the datanode and regionserver?  How are you doing the
        upload?  Describe what your uploader looks like (Sorry if
        you've already done this).


             I already changed the limit of files descriptors,

        Good.


             I'll try
            to change the properties:
             <property> <name>dfs.datanode.socket.write.timeout</name>
             <value>0</value>
            </property>

            <property>
             <name>dfs.datanode.max.xcievers</name>
             <value>1023</value>
            </property>

        Yeah, try it.


            And let you know, is any other prescriptions ? Did i miss
            something ?

            BTW, off topic, but i sent e-mail recently to the list and
            i can't see it:
            Is it possible to delete multiple columns in any way by
            regex : for example
            colum_name_* ?

        Not that I know of.  If its not in the API, it should be.
         Mind filing a JIRA?

        Thanks Slava.
        St.Ack

Re: Regionserver fails to serve region

Reply via email to