Re: Region Servers going down frequently

Rakhi Khatwani Mon, 13 Apr 2009 22:50:44 -0700

Now i am using ec2 large instance. and used the xceiver settings as
suggested by some of you. things were fine for 4 days and finally 2day one
of the data nodes shut down.
and the region server fails for some other node.


the logs r as follows:

this is for the region server goin down:


/***** regionserver logs ****************/
2009-04-14 04:56:16,930 ERROR
org.apache.hadoop.hbase.regionserver.StoreFileScanner: [...@2eb38825 closing
scanner
java.io.IOException: Filesystem closed
       at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
       at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
       at
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.close(DFSClient.java:1573)
       at java.io.FilterInputStream.close(FilterInputStream.java:155)
       at
org.apache.hadoop.hbase.io.SequenceFile$Reader.close(SequenceFile.java:1598)
       at org.apache.hadoop.hbase.io.MapFile$Reader.close(MapFile.java:586)
       at
org.apache.hadoop.hbase.regionserver.StoreFileScanner.close(StoreFileScanner.java:356)
       at
org.apache.hadoop.hbase.regionserver.HStoreScanner.closeScanner(HStoreScanner.java:289)
       at
org.apache.hadoop.hbase.regionserver.HStoreScanner.doClose(HStoreScanner.java:309)
       at
org.apache.hadoop.hbase.regionserver.HStoreScanner.close(HStoreScanner.java:303)
       at
org.apache.hadoop.hbase.regionserver.HRegion$HScanner.closeScanner(HRegion.java:2119)
       at
org.apache.hadoop.hbase.regionserver.HRegion$HScanner.close(HRegion.java:2139)
       at
org.apache.hadoop.hbase.regionserver.HRegionServer$ScannerListener.leaseExpired(HRegionServer.java:1759)
       at org.apache.hadoop.hbase.Leases.run(Leases.java:95)
2009-04-14 04:56:16,930 INFO org.apache.hadoop.hbase.Leases:
regionserver/0.0.0.0:60020.leaseChecker closing leases
2009-04-14 04:56:16,930 INFO org.apache.hadoop.hbase.Leases:
regionserver/0.0.0.0:60020.leaseChecker closed leases
2009-04-14 04:56:16,931 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Starting shutdown
thread.
2009-04-14 04:56:16,932 INFO
org.apache.hadoop.hbase.regionserver.HRegionServer: Shutdown thread complete


********************* tasktracker logs ******************************8
INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0040/attempt_200904120910_0040_r_000000_1/output/file.out
in any of the configured local directories
2009-04-14 05:13:46,701 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0040/attempt_200904120910_0040_r_000000_1/output/file.out
in any of the configured local directories
2009-04-14 05:13:51,741 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0040/attempt_200904120910_0040_r_000000_1/output/file.out
in any of the configured local directories
2009-04-14 05:13:54,042 INFO org.apache.hadoop.mapred.JvmManager: JVM :
jvm_200904120910_0040_r_804408609 exited. Number of tasks it ran: 0
2009-04-14 05:13:56,751 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0040/attempt_200904120910_0040_r_000000_1/output/file.out
in any of the configured local directories
2009-04-14 05:13:57,070 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200904120910_0040_r_000000_1 done; removing files.
2009-04-14 05:13:57,072 INFO org.apache.hadoop.mapred.TaskTracker:
addFreeSlot : current free slots : 3

the logs for datanode seemed fine.
the datanode and tasktracker are still alive for this node.


for the other node which had been shut down, the datanode logs seemed fine.
no issue with it but
at the tasktracker i get the following info:

2009-04-13 07:53:12,291 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0005/attempt_200904120910_0005_m_000000_0/output/file.out
in any of the configured local directories
2009-04-13 07:53:13,592 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200904120910_0005_m_000000_0 0.0% Starting Analysis...
2009-04-13 07:53:16,621 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200904120910_0005_m_000000_0 0.0% Starting Analysis...
2009-04-13 07:53:17,352 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0005/attempt_200904120910_0005_m_000000_0/output/file.out
in any of the configured local directories
2009-04-13 07:53:19,651 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200904120910_0005_m_000000_0 0.0% Starting Analysis...
2009-04-13 07:53:22,361 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0005/attempt_200904120910_0005_m_000000_0/output/file.out
in any of the configured local directories
2009-04-13 07:53:22,716 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200904120910_0005_m_000000_0 0.0% Starting Analysis...
2009-04-13 07:53:25,722 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200904120910_0005_m_000000_0 0.0% Starting Analysis...
2009-04-13 07:53:27,371 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200904120910_0005/attempt_200904120910_0005_m_000000_0/output/file.out
in any of the configured local directories
2009-04-13 07:53:28,734 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200904120910_0005_m_000000_0 0.0% Starting Analysis...

1. what could be the issue with it??

2. moreover when i restart the datanode/region server i am not able to
access all the tables
for one of the tables, after fetching 6000 rows, i get the following
exception:

ava.io.IOException: Call to /10.254.74.127:60020 failed on local exception:
java.io.EOFException
java.io.IOException: Call to /10.254.74.127:60020 failed on local exception:
java.io.EOFException
java.io.IOException: Call to /10.254.74.127:60020 failed on local exception:
java.io.EOFException
java.io.IOException: Call to /10.254.74.127:60020 failed on local exception:
java.io.EOFException
java.io.IOException: Call to /10.254.74.127:60020 failed on local exception:
java.io.EOFException
from org/apache/hadoop/hbase/client/HTable.java:1704:in `hasNext'
       from sun.reflect.GeneratedMethodAccessor3:-1:in `invoke'
       from sun/reflect/DelegatingMethodAccessorImpl.java:25:in `invoke'
       from java/lang/reflect/Method.java:597:in `invoke'
       from org/jruby/javasupport/JavaMethod.java:250:in
`invokeWithExceptionHandling'
       from org/jruby/javasupport/JavaMethod.java:219:in `invoke'
       from org/jruby/javasupport/JavaClass.java:416:in `execute'
       from
org/jruby/internal/runtime/methods/SimpleCallbackMethod.java:67:in `call'
       from org/jruby/internal/runtime/methods/DynamicMethod.java:70:in
`call'
       from org/jruby/runtime/CallSite.java:295:in `call'
       from org/jruby/evaluator/ASTInterpreter.java:646:in `callNode'
       from org/jruby/evaluator/ASTInterpreter.java:324:in `evalInternal'
       from org/jruby/evaluator/ASTInterpreter.java:1790:in `whileNode'
       from org/jruby/evaluator/ASTInterpreter.java:505:in `evalInternal'
       from org/jruby/evaluator/ASTInterpreter.java:620:in `blockNode'
       from org/jruby/evaluator/ASTInterpreter.java:318:in `evalInternal'
clear

this happens even after the restarting the regionservers, its like now i
cant access this table anytime.
will increasing the dfs replication help? what should i do to avoid succha
thing?


Thanks,
Raakhi Khatwani.



On Wed, Apr 8, 2009 at 10:50 PM, Andrew Purtell <[email protected]> wrote:

>
> I think you are confusing HDFS block balancing with HBase
> region deployment balancing.
>
> You ran 'hadoop balancer', correct? This does not have
> anything to do with HBase. It will move file blocks around
> on HDFS underneath HBase.
>
> Due to problems with caching of block locations done by
> HBase's HDFS client, it is not advisable to move blocks
> around under a running HBase cluster at this time.
>
> All load balancing functions of HBase are automatic. But,
> they depend on having enough data in your table to make
> splits and deploy then the various regions around in a
> balanced manner.
>
> Hope this helps,
>
>   - Andy
>
> > From: Rakhi Khatwani <[email protected]>
> > Subject: Re: Region Servers going down frequently
> > To: [email protected]
> > Date: Wednesday, April 8, 2009, 12:29 AM
> > Thanks, Amandeep
> >
> > One more question, i have mailed it earlier and i have
> > attached the snapshot along with that email.
> > I have noticed it that all my requests are handled by one
> > region server...
> > Is there any way to balance the load?
> > and will balancing the load improve the performance?
> >
> > PS: I have tried using hadoop load balancing but after some
> > time some of my region servers shut down... i have even
> > gone through the archives and someone did report an
> > unstable cluster due to load balancing. so i really
> > dont know if i should turn load balancing on?
>
>
>
>
>

Re: Region Servers going down frequently

Reply via email to