Re: Regionserver fails to serve region

stack Thu, 30 Oct 2008 13:26:28 -0700

Can you put them someplace that I can pull them?

I took another look at your logs. I see that a region is missingfiles. That means it will never open and just keep trying. Grep yourlogs for FileNotFound. You'll see this:

hbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:File does not exist:hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906/datahbase-clmanager-regionserver-ILREDHAT012.log:java.io.FileNotFoundException:File does not exist:hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637/data

Try shutting down, and removing these files. Remove the followingdirectories:


hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/647541142630058906
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/mapfiles/2243545870343537637
hdfs://X:9000/hbase/BizDB/735893330/BusinessObject/info/2243545870343537637

Then retry restarting.

You can try and figure how these files got lost by going back in yourhistory.


St.Ack



Slava Gorelik wrote:

Michael,still have the problem, but the logs files are very big (50MB each)
even compressed they are bigger than limit for this mailing list.
Most of the problems are happened during compaction (i see in the log), may
be i can send some parts from logs ?

Best Regards.

On Thu, Oct 30, 2008 at 8:49 PM, Slava Gorelik <[EMAIL PROTECTED]>wrote:

Sorry, my mistake, i did it for wrong user name.Thanks, updating now, soon
will try again.


On Thu, Oct 30, 2008 at 8:39 PM, Slava Gorelik <[EMAIL PROTECTED]>wrote:

Hi.Very strange, i see in limits.conf that it's upped.
I attached the limits.conf, please have a  look, may be i did it wrong.

Best Regards.


On Thu, Oct 30, 2008 at 7:52 PM, stack <[EMAIL PROTECTED]> wrote:

Thanks for the logs Slava.  I notice that you have not upped the ulimit
on your cluster.  See the head of your logs where we print out the ulimit.
 Its 1024.  This could be one cause of your grief especially when you
seemingly have many regions (>1000).  Please try upping it.
St.Ack




Slava Gorelik wrote:

Hi.
I enabled DEBUG log level and now I'm sending all logs (archived)
including fsck run result.
Today my program starting to fail couple of minutes from the begin, it's
very easy to reproduce the problem, cluster became very unstable.

Best Regards.

On Tue, Oct 28, 2008 at 11:05 PM, stack <[EMAIL PROTECTED] <mailto:
[EMAIL PROTECTED]>> wrote:

See http://wiki.apache.org/hadoop/Hbase/FAQ#5

St.Ack

Slava Gorelik wrote:

Hi.First of all i want to say thank you for you assistance !!!

DEBUG on hadoop or hbase ? And how can i enable ?
fsck said that HDFS is healthy.

Best Regards and Thank You

On Tue, Oct 28, 2008 at 8:45 PM, stack <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:

Slava Gorelik wrote:

Hi.HDFS capacity is about 800gb (8 datanodes) and the
current usage is
about
30GB. This is after total re-format of the HDFS that
was made a hour
before.

BTW, the logs i sent are from the first exception that
i found in them.
Best Regards.

Please enable DEBUG and retry. Send me all logs. What
does the fsck on
HDFS say? There is something seriously wrong with your
cluster that you are
having so much trouble getting it running. Lets try and
figure it.

St.Ack

On Tue, Oct 28, 2008 at 7:12 PM, stack
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

I took a quick look Slava (Thanks for sending the
files). Here's a few
notes:

+ The logs are from after the damage is done; the
transition from good to
bad is missing. If I could see that, that would help
+ But what seems to be plain is that that your
HDFS is very sick. See
this
from head of one of the regionserver logs:

2008-10-27 23:41:12,682 WARN
org.apache.hadoop.dfs.DFSClient:
DataStreamer
Exception: java.io.IOException: Unable to create
new block.
at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2349)
at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1735)
at

org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1912)

2008-10-27 23:41:12,682 WARN
org.apache.hadoop.dfs.DFSClient: Error
Recovery for block blk_-5188192041705782716_60000
bad datanode[0]
2008-10-27 23:41:12,685 ERROR

org.apache.hadoop.hbase.regionserver.CompactSplitThread:
Compaction/Split
failed for region

BizDB,1.1.PerfBO1.f2188a42-5eb7-4a6a-82ef-2da0d0ea4ce0,1225136351518
java.io.IOException: Could not get block
locations. Aborting...

If HDFS is ailing, hbase is too. In fact, the
regionservers will shut
themselves to protect themselves against damaging
or losing data:

2008-10-27 23:41:12,688 FATAL
org.apache.hadoop.hbase.regionserver.Flusher:
Replay of hlog required. Forcing server restart

So, whats up with your HDFS? Not enough space
alloted? What happens if
you run "./bin/hadoop fsck /"? Does that give you
a clue as to what
happened? Dig in the datanode and namenode logs.
Look for where the
exceptions start. It might give you a clue.

+ The suse regionserver log had garbage in it.

St.Ack

Slava Gorelik wrote:

Hi.
My happiness was very short :-( After i
successfully added 1M rows (50k
each row) i tried to add 10M rows.
And after 3-4 working hours it started to
dying. First one region server
is died, after another one and eventually all
cluster is dead.

I attached log files (relevant part, archived)
from region servers and
from the master.

Best Regards.

On Mon, Oct 27, 2008 at 11:19 AM, Slava Gorelik <
[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]><mailto:
[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>>> wrote:

Hi.
So far so good, after changing the file
descriptors
and dfs.datanode.socket.write.timeout,
dfs.datanode.max.xcievers
my cluster works stable.
Thank You and Best Regards.

P.S. Regarding deleting multiple columns
missing functionality i
filled jira :
https://issues.apache.org/jira/browse/HBASE-961

On Sun, Oct 26, 2008 at 12:58 AM, Michael
Stack <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]

<mailto:[EMAIL PROTECTED]


                       <mailto:[EMAIL PROTECTED]>>> wrote:

                            Slava Gorelik wrote:

                                Hi.Haven't tried yet them, i'll try
                       tomorrow morning. In
                                general cluster is
                                working well, the problems begins if
                       i'm trying to add 10M
                                rows, after 1.2M
                                if happened.

                            Anything else running beside the
                       regionserver or datanodes
                            that would suck resources?  When
                       datanodes begin to slow, we
                            begin to see the issue Jean-Adrien's
                       configurations address.
                             Are you uploading using MapReduce?  Are
                       TTs running on same
                            nodes as the datanode and regionserver?
                        How are you doing the
                            upload?  Describe what your uploader
                       looks like (Sorry if
                            you've already done this).


                                 I already changed the limit of files
                       descriptors,

                            Good.


                                 I'll try
                                to change the properties:
                                 <property>
                       <name>dfs.datanode.socket.write.timeout</name>
                                 <value>0</value>
                                </property>

                                <property>
                                 <name>dfs.datanode.max.xcievers</name>
                                 <value>1023</value>
                                </property>


                            Yeah, try it.


                                And let you know, is any other
                       prescriptions ? Did i miss
                                something ?

                                BTW, off topic, but i sent e-mail
                       recently to the list and
                                i can't see it:
                                Is it possible to delete multiple
                       columns in any way by
                                regex : for example
                                colum_name_* ?

                            Not that I know of.  If its not in the
                       API, it should be.
                             Mind filing a JIRA?

                            Thanks Slava.
                            St.Ack

Re: Regionserver fails to serve region

Reply via email to