Re: Regionserver fails to serve region

Slava Gorelik Tue, 28 Oct 2008 11:36:37 -0700

Hi.HDFS capacity is about 800gb (8 datanodes) and the current usage is about
30GB. This is after total re-format of the HDFS that was made a hour before.


BTW, the logs i sent are from the first exception that i found in them.
Best Regards.


On Tue, Oct 28, 2008 at 7:12 PM, stack <[EMAIL PROTECTED]> wrote:

> I took a quick look Slava (Thanks for sending the files).   Here's a few
> notes:
>
> + The logs are from after the damage is done; the transition from good to
> bad is missing.  If I could see that, that would help
> + But what seems to be plain is that that your HDFS is very sick.  See this
> from head of one of the regionserver logs:
>
> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: DataStreamer
> Exception: java.io.IOException: Unable to create new block.
>   at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
>   at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
>   at
> org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)
>
> 2008-10-27 23:41:12,682 WARN org.apache.hadoop.dfs.DFSClient: Error
> Recovery for block blk_-5188192041705782716_60000 bad datanode[0]
> 2008-10-27 23:41:12,685 ERROR
> org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction/Split
> failed for region
> BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
> java.io.IOException: Could not get block locations. Aborting...
>
>
> If HDFS is ailing, hbase is too.  In fact, the regionservers will shut
> themselves to protect themselves against damaging or losing data:
>
> 2008-10-27 23:41:12,688 FATAL org.apache.hadoop.hbase.regionserver.Flusher:
> Replay of hlog required. Forcing server restart
>
> So, whats up with your HDFS?  Not enough space alloted?  What happens if
> you run "./bin/hadoop fsck /"?  Does that give you a clue as to what
> happened?  Dig in the datanode and namenode logs.  Look for where the
> exceptions start.  It might give you a clue.
>
> + The suse regionserver log had garbage in it.
>
> St.Ack
>
>
> Slava Gorelik wrote:
>
>> Hi.
>> My happiness was very short :-( After i successfully added 1M rows (50k
>> each row) i tried to add 10M rows.
>> And after 3-4 working hours it started to dying. First one region server
>> is died, after another one and eventually all cluster is dead.
>>
>> I attached log files (relevant part, archived) from region servers and
>> from the master.
>>
>> Best Regards.
>>
>>
>>
>> On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <[EMAIL PROTECTED]<mailto:
>> [EMAIL PROTECTED]>> wrote:
>>
>>    Hi.
>>    So far so good, after changing the file descriptors
>>    and dfs.datanode.socket.write.timeout, dfs.datanode.max.xcievers
>>    my cluster works stable.
>>    Thank You and Best Regards.
>>
>>    P.S. Regarding deleting multiple columns missing functionality i
>>    filled jira : https://issues.apache.org/jira/browse/HBASE-961
>>
>>
>>
>>    On Sun, Oct 26, 2008 at 12:58 AM, Michael Stack <[EMAIL PROTECTED]
>>    <mailto:[EMAIL PROTECTED]>> wrote:
>>
>>        Slava Gorelik wrote:
>>
>>            Hi.Haven't tried yet them, i'll try tomorrow morning. In
>>            general cluster is
>>            working well, the problems begins if i'm trying to add 10M
>>            rows, after 1.2M
>>            if happened.
>>
>>        Anything else running beside the regionserver or datanodes
>>        that would suck resources?  When datanodes begin to slow, we
>>        begin to see the issue Jean-Adrien's configurations address.
>>         Are you uploading using MapReduce?  Are TTs running on same
>>        nodes as the datanode and regionserver?  How are you doing the
>>        upload?  Describe what your uploader looks like (Sorry if
>>        you've already done this).
>>
>>
>>             I already changed the limit of files descriptors,
>>
>>        Good.
>>
>>
>>             I'll try
>>            to change the properties:
>>             <property> <name>dfs.datanode.socket.write.timeout</name>
>>             <value>0</value>
>>            </property>
>>
>>            <property>
>>             <name>dfs.datanode.max.xcievers</name>
>>             <value>1023</value>
>>            </property>
>>
>>
>>        Yeah, try it.
>>
>>
>>            And let you know, is any other prescriptions ? Did i miss
>>            something ?
>>
>>            BTW, off topic, but i sent e-mail recently to the list and
>>            i can't see it:
>>            Is it possible to delete multiple columns in any way by
>>            regex : for example
>>            colum_name_* ?
>>
>>        Not that I know of.  If its not in the API, it should be.
>>         Mind filing a JIRA?
>>
>>        Thanks Slava.
>>        St.Ack
>>
>>
>>
>>
>

Re: Regionserver fails to serve region

Reply via email to