Re: HMaster and HRegionServer going down

Azuryy Yu Wed, 05 Jun 2013 05:20:14 -0700

do you have GC log? and what you did during crash? and whats your gc
options?


for the dn error, thats net work issue generally, because dn received an
incomplete packet.

--Send from my Sony mobile.
On Jun 5, 2013 8:10 PM, "Vimal Jain" <vkj...@gmail.com> wrote:

> Yes.
> Thats true.
> There are some errors in all 3 logs during same period , i.e. data , master
> and region.
> But i am unable to deduce the exact cause of error.
> Can you please help in detecting the problem ?
>
> So far i am suspecting following :-
> I have 1GB heap (default) allocated for all 3 processes , i.e.
> Master,Region,Zookeeper.
> Both  Master and Region took more time for GC ( as inferred from lines in
> logs like "slept more time than configured one" etc ) .
> Due to this there was  zookeeper connection time out for both Master and
> Region and hence both went down.
>
> I am newbie to Hbase and hence may be my findings are not correct.
> I want to be 100 % sure before increasing heap space for both Master and
> Region ( Both around 2GB) to solve this.
> At present i have restarted the cluster with default heap space only ( 1GB
> ).
>
>
>
> On Wed, Jun 5, 2013 at 5:23 PM, Azuryy Yu <azury...@gmail.com> wrote:
>
> > there have errors in your dats node log, and the error time match with rs
> > log error time.
> >
> > --Send from my Sony mobile.
> > On Jun 5, 2013 5:06 PM, "Vimal Jain" <vkj...@gmail.com> wrote:
> >
> > > I don't think so , as i dont find any issues in data node logs.
> > > Also there are lot of exceptions like "session expired" , "slept more
> > than
> > > configured time" . what are these ?
> > >
> > >
> > > On Wed, Jun 5, 2013 at 2:27 PM, Azuryy Yu <azury...@gmail.com> wrote:
> > >
> > > > Because your data node 192.168.20.30 broke down. which leads to RS
> > down.
> > > >
> > > >
> > > > On Wed, Jun 5, 2013 at 3:19 PM, Vimal Jain <vkj...@gmail.com> wrote:
> > > >
> > > > > Here is the complete log:
> > > > >
> > > > > http://bin.cakephp.org/saved/103001 - Hregion
> > > > > http://bin.cakephp.org/saved/103000 - Hmaster
> > > > > http://bin.cakephp.org/saved/103002 - Datanode
> > > > >
> > > > >
> > > > > On Wed, Jun 5, 2013 at 11:58 AM, Vimal Jain <vkj...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi,
> > > > > > I have set up Hbase in pseudo-distributed mode.
> > > > > > It was working fine for 6 days , but suddenly today morning both
> > > > HMaster
> > > > > > and Hregion process went down.
> > > > > > I checked in logs of both hadoop and hbase.
> > > > > > Please help here.
> > > > > > Here are the snippets :-
> > > > > >
> > > > > > *Datanode logs:*
> > > > > > 2013-06-05 05:12:51,436 INFO
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> > > > > receiveBlock
> > > > > > for block blk_1597245478875608321_2818 java.io.EOFException:
> while
> > > > trying
> > > > > > to read 2347 bytes
> > > > > > 2013-06-05 05:12:51,442 INFO
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock
> > > > > > blk_1597245478875608321_2818 received exception
> > java.io.EOFException:
> > > > > while
> > > > > > trying to read 2347 bytes
> > > > > > 2013-06-05 05:12:51,442 ERROR
> > > > > > org.apache.hadoop.hdfs.server.datanode.DataNode:
> > > DatanodeRegistration(
> > > > > > 192.168.20.30:50010,
> > > > > > storageID=DS-1816106352-192.168.20.30-50010-1369314076237,
> > > > > infoPort=50075,
> > > > > > ipcPort=50020):DataXceiver
> > > > > > java.io.EOFException: while trying to read 2347 bytes
> > > > > >
> > > > > >
> > > > > > *HRegion logs:*
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4694929ms instead of 3000ms, this is likely due to a long
> > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:51,045 WARN org.apache.hadoop.hdfs.DFSClient:
> > > > > > DFSOutputStream ResponseProcessor exception  for block
> > > > > > blk_1597245478875608321_2818java.net.SocketTimeoutException:
> 63000
> > > > millis
> > > > > > timeout while waiting for channel to be ready for read. ch :
> > > > > > java.nio.channels.SocketChannel[connected local=/
> > 192.168.20.30:44333
> > > > > remote=/
> > > > > > 192.168.20.30:50010]
> > > > > > 2013-06-05 05:12:51,046 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 11695345ms instead of 10000000ms, this is likely due to a
> > long
> > > > > > garbage collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:51,048 WARN org.apache.hadoop.hdfs.DFSClient:
> > Error
> > > > > > Recovery for block blk_1597245478875608321_2818 bad datanode[0]
> > > > > > 192.168.20.30:50010
> > > > > > 2013-06-05 05:12:51,075 WARN org.apache.hadoop.hdfs.DFSClient:
> > Error
> > > > > while
> > > > > > syncing
> > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > > Aborting...
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > > 2013-06-05 05:12:51,110 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > > Requesting
> > > > > > close of hlog
> > > > > > java.io.IOException: Reflection
> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > 2013-06-05 05:12:51,180 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Could not sync.
> > > > Requesting
> > > > > > close of hlog
> > > > > > java.io.IOException: Reflection
> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > 2013-06-05 05:12:51,183 ERROR
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Failed close of
> HLog
> > > > > writer
> > > > > > java.io.IOException: Reflection
> > > > > > Caused by: java.lang.reflect.InvocationTargetException
> > > > > > Caused by: java.io.IOException: DFSOutputStream is closed
> > > > > > 2013-06-05 05:12:51,184 WARN
> > > > > > org.apache.hadoop.hbase.regionserver.wal.HLog: Riding over HLog
> > close
> > > > > > failure! error count=1
> > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING
> region
> > > > > server
> > > > > > hbase.rummycircle.com,60020,1369877672964:
> > > > > > regionserver:60020-0x13ef31264d00001
> > > > regionserver:60020-0x13ef31264d00001
> > > > > > received expired from ZooKeeper, aborting
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired
> > > > > > 2013-06-05 05:12:52,557 FATAL
> > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer
> > > abort:
> > > > > > loaded coprocessors are: []
> > > > > > 2013-06-05 05:12:52,621 INFO
> > > > > > org.apache.hadoop.hbase.regionserver.SplitLogWorker:
> SplitLogWorker
> > > > > > interrupted while waiting for task, exiting:
> > > > > java.lang.InterruptedException
> > > > > > java.io.InterruptedIOException: Aborting compaction of store
> > cfp_info
> > > > in
> > > > > > region
> > > > event_data,244630,1369879570539.3ebddcd11a3c22585a690bf40911cb1e.
> > > > > > because user requested stop.
> > > > > > 2013-06-05 05:12:53,425 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:12:55,426 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:12:59,427 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:13:07,427 WARN
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly
> > > > > transient
> > > > > > ZooKeeper exception:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > > 2013-06-05 05:13:07,427 ERROR
> > > > > > org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper
> > > > delete
> > > > > > failed after 3 retries
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired for /hbase/rs/
> > > hbase.rummycircle.com
> > > > > > ,60020,1369877672964
> > > > > >     at
> > > > > >
> > org.apache.zookeeper.KeeperException.create(KeeperException.java:127)
> > > > > >     at
> > > > >
> org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
> > > > > > 2013-06-05 05:13:07,436 ERROR org.apache.hadoop.hdfs.DFSClient:
> > > > Exception
> > > > > > closing file /hbase/.logs/hbase.rummycircle.com
> > ,60020,1369877672964/
> > > > > > hbase.rummycircle.com%2C60020%2C1369877672964.1370382721642 :
> > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > > Aborting...
> > > > > > java.io.IOException: All datanodes 192.168.20.30:50010 are bad.
> > > > > > Aborting...
> > > > > >     at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3096)
> > > > > >
> > > > > >
> > > > > > *HMaster logs:*
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4702394ms instead of 10000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4988731ms instead of 300000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4988726ms instead of 300000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,701 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4698291ms instead of 10000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,711 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4694502ms instead of 1000ms, this is likely due to a long
> > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,714 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4694492ms instead of 1000ms, this is likely due to a long
> > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:50,715 WARN
> org.apache.hadoop.hbase.util.Sleeper:
> > We
> > > > > > slept 4695589ms instead of 60000ms, this is likely due to a long
> > > > garbage
> > > > > > collecting pause and it's usually bad, see
> > > > > > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> > > > > > 2013-06-05 05:12:52,263 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Master server abort: loaded coprocessors are: []
> > > > > > 2013-06-05 05:12:52,465 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Waiting for region servers count to settle; currently checked in
> 1,
> > > > slept
> > > > > > for 0 ms, expecting minimum of 1, maximum of 2147483647, timeout
> of
> > > > 4500
> > > > > > ms, interval of 1500 ms.
> > > > > > 2013-06-05 05:12:52,561 ERROR
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > Region server hbase.rummycircle.com,60020,1369877672964
> reported a
> > > > fatal
> > > > > > error:
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired
> > > > > > 2013-06-05 05:12:53,970 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Waiting for region servers count to settle; currently checked in
> 1,
> > > > slept
> > > > > > for 1506 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> > > of
> > > > > 4500
> > > > > > ms, interval of 1500 ms.
> > > > > > 2013-06-05 05:12:55,476 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Waiting for region servers count to settle; currently checked in
> 1,
> > > > slept
> > > > > > for 3012 ms, expecting minimum of 1, maximum of 2147483647,
> timeout
> > > of
> > > > > 4500
> > > > > > ms, interval of 1500 ms.
> > > > > > 2013-06-05 05:12:56,981 INFO
> > > > > org.apache.hadoop.hbase.master.ServerManager:
> > > > > > Finished waiting for region servers count to settle; checked in
> 1,
> > > > slept
> > > > > > for 4517 ms, expecting minimum of 1, maximum of 2147483647,
> master
> > is
> > > > > > running.
> > > > > > 2013-06-05 05:12:57,019 INFO
> > > > > > org.apache.hadoop.hbase.catalog.CatalogTracker: Failed
> verification
> > > of
> > > > > > -ROOT-,,0 at address=hbase.rummycircle.com,60020,1369877672964;
> > > > > > java.io.EOFException
> > > > > > 2013-06-05 05:17:52,302 WARN
> > > > > > org.apache.hadoop.hbase.master.SplitLogManager: error while
> > splitting
> > > > > logs
> > > > > > in [hdfs://
> > > > > >
> > > > >
> > > >
> > >
> >
> 192.168.20.30:9000/hbase/.logs/hbase.rummycircle.com,60020,1369877672964-splitting
> > > > > ]
> > > > > > installed = 19 but only 0 done
> > > > > > 2013-06-05 05:17:52,321 FATAL
> > org.apache.hadoop.hbase.master.HMaster:
> > > > > > master:60000-0x13ef31264d00000 master:60000-0x13ef31264d00000
> > > received
> > > > > > expired from ZooKeeper, aborting
> > > > > > org.apache.zookeeper.KeeperException$SessionExpiredException:
> > > > > > KeeperErrorCode = Session expired
> > > > > > java.io.IOException: Giving up after tries=1
> > > > > > Caused by: java.lang.InterruptedException: sleep interrupted
> > > > > > 2013-06-05 05:17:52,381 ERROR
> > > > > > org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to
> start
> > > > master
> > > > > > java.lang.RuntimeException: HMaster Aborted
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Thanks and Regards,
> > > > > > Vimal Jain
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Thanks and Regards,
> > > > > Vimal Jain
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards,
> > > Vimal Jain
> > >
> >
>
>
>
> --
> Thanks and Regards,
> Vimal Jain
>

Re: HMaster and HRegionServer going down

Reply via email to