Can you pastebin more of the zookeeper / master logs (so that we can have more context) ?
Cheers On Thu, Mar 23, 2017 at 12:04 AM, Margus Roo <mar...@roo.ee> wrote: > In the same time in zookeeper log: > > 2017-03-23 02:01:33,004 - WARN [NIOServerCxn.Factory:0.0.0.0/ > 0.0.0.0:2181:NIOServerCnxn@357] - caught end of stream exception > EndOfStreamException: Unable to read additional data from client sessionid > 0x35af577e0ac0000, likely client has closed socket > at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn > .java:228) > at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServ > erCnxnFactory.java:208) > at java.lang.Thread.run(Thread.java:745) > 2017-03-23 02:01:35,482 - INFO [NIOServerCxn.Factory:0.0.0.0/ > 0.0.0.0:2181:NIOServerCnxn@1007] - Closed socket connection for client / > 192.168.80.51:44456 which had sessionid 0x35af577e0ac0000 > > > Margus (margusja) Roo > http://margus.roo.ee > skype: margusja > https://www.facebook.com/allan.tuuring > +372 51 48 780 > > On 23/03/2017 08:43, Ted Yu wrote: > >> Have you checked zookeeper logs to see if there was some clue ? >> >> Cheers >> >> On Mar 22, 2017, at 11:30 PM, Margus Roo <mar...@roo.ee> wrote: >>> >>> Hi >>> >>> Almost every night hbase master is closed. In error log I can see: >>> gc.log: >>> 2017-03-23T01:59:27.239+0200: 41752.366: [GC (Allocation Failure) >>> 2017-03-23T01:59:27.239+0200: 41752.366: [ParNew: 159203K->11611K(166464K), >>> 0.0115189 secs] 177260K->29669K(536512K), 0.0117362 secs] [Times: user=0.08 >>> sys=0.00, real=0.01 secs] >>> Heap >>> par new generation total 166464K, used 137930K [0x00000000c0000000, >>> 0x00000000cb4a0000, 0x00000000d5550000) >>> eden space 147968K, 85% used [0x00000000c0000000, 0x00000000c7b5b8b8, >>> 0x00000000c9080000) >>> from space 18496K, 62% used [0x00000000ca290000, 0x00000000cade6fa8, >>> 0x00000000cb4a0000) >>> to space 18496K, 0% used [0x00000000c9080000, 0x00000000c9080000, >>> 0x00000000ca290000) >>> concurrent mark-sweep generation total 370048K, used 18057K >>> [0x00000000d5550000, 0x00000000ebeb0000, 0x0000000100000000) >>> Metaspace used 55061K, capacity 56096K, committed 56400K, reserved >>> 1099776K >>> class space used 5899K, capacity 6255K, committed 6264K, reserved >>> 1048576K >>> >>> >>> >>> >>> In master.log >>> 2017-03-23 02:02:09,178 WARN [master/nn3/192.168.80.51:16000-EventThread] >>> client.ConnectionManager$HConnectionImplementation: This client just >>> lost it's session with ZooKeeper, closing it. It will be recreated next >>> time someone needs it >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired >>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectio >>> nEvent(ZooKeeperWatcher.java:585) >>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process( >>> ZooKeeperWatcher.java:517) >>> at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process( >>> PendingWatcher.java:40) >>> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent( >>> ClientCnxn.java:534) >>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn. >>> java:510) >>> 2017-03-23 02:02:10,579 FATAL [main-EventThread] master.HMaster: Master >>> server abort: loaded coprocessors are: [org.apache.ranger.authorizati >>> on.hbase.RangerAuthorizationCoprocessor, >>> org.apache.hadoop.hbase.backup.master.BackupController, >>> org.apache.hadoop.hbase.security.visibility.VisibilityController] >>> 2017-03-23 02:02:10,857 FATAL [main-EventThread] master.HMaster: >>> master:16000-0x15adbb9b9db078a, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> baseZNode=/hbase-unsecure master:16000-0x15adbb9b9db078a received expired >>> from ZooKeeper, aborting >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired >>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectio >>> nEvent(ZooKeeperWatcher.java:585) >>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process( >>> ZooKeeperWatcher.java:517) >>> at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process( >>> PendingWatcher.java:40) >>> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent( >>> ClientCnxn.java:534) >>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn. >>> java:510) >>> 2017-03-23 02:02:10,090 INFO [main-SendThread(nn3:2181)] >>> zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session >>> 0x15adbb9b9db078a has expired, closing socket connection >>> 2017-03-23 02:02:09,181 WARN [nn3:16000.activeMasterManager-EventThread] >>> client.ConnectionManager$HConnectionImplementation: This client just >>> lost it's session with ZooKeeper, closing it. It will be recreated next >>> time someone needs it >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired >>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectio >>> nEvent(ZooKeeperWatcher.java:585) >>> at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process( >>> ZooKeeperWatcher.java:517) >>> at org.apache.hadoop.hbase.zookeeper.PendingWatcher.process( >>> PendingWatcher.java:40) >>> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent( >>> ClientCnxn.java:534) >>> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn. >>> java:510) >>> 2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] >>> client.ConnectionManager$HConnectionImplementation: Closing zookeeper >>> sessionid=0x25adbb9ba62075d >>> 2017-03-23 02:02:10,894 INFO [nn3:16000.activeMasterManager-EventThread] >>> zookeeper.ClientCnxn: EventThread shut down >>> 2017-03-23 02:02:10,876 INFO [master/nn3/192.168.80.51:16000-EventThread] >>> client.ConnectionManager$HConnectionImplementation: Closing zookeeper >>> sessionid=0x25adbb9ba62075c >>> 2017-03-23 02:02:10,897 INFO [master/nn3/192.168.80.51:16000-EventThread] >>> zookeeper.ClientCnxn: EventThread shut down >>> 2017-03-23 02:02:10,925 INFO [main-EventThread] >>> regionserver.HRegionServer: STOPPED: master:16000-0x15adbb9b9db078a, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> baseZNode=/hbase-unsecure master:16000-0x15adbb9b9db078a received expired >>> from ZooKeeper, aborting >>> 2017-03-23 02:02:10,935 INFO [main-EventThread] zookeeper.ClientCnxn: >>> EventThread shut down >>> 2017-03-23 02:02:11,005 INFO [master/nn3/192.168.80.51:16000] >>> regionserver.HRegionServer: Stopping infoServer >>> 2017-03-23 02:02:11,624 INFO >>> [nn3,16000,1490185417271_splitLogManager__ChoreService_1] >>> master.SplitLogManager$TimeoutMonitor: Chore: SplitLogManager Timeout >>> Monitor was stopped >>> 2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters >>> 2017-03-23 02:02:12,104 INFO [master/nn3/192.168.80.51:16000] >>> mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010 >>> 2017-03-23 02:02:11,628 WARN [nn3,16000,1490185417271_ChoreService_1] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters >>> 2017-03-23 02:02:12,104 INFO [master/nn3/192.168.80.51:16000] >>> mortbay.log: Stopped SelectChannelConnector@0.0.0.0:16010 >>> 2017-03-23 02:02:12,286 INFO [master/nn3/192.168.80.51:16000] >>> procedure2.ProcedureExecutor: Stopping the procedure executor >>> 2017-03-23 02:02:12,336 INFO [master/nn3/192.168.80.51:16000] >>> wal.WALProcedureStore: Stopping the WAL Procedure Store >>> 2017-03-23 02:02:13,044 WARN [nn3,16000,1490185417271_ChoreService_1] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/backup-masters >>> 2017-03-23 02:02:14,497 INFO [master/nn3/192.168.80.51:16000] >>> regionserver.HRegionServer: stopping server nn3,16000,1490185417271 >>> 2017-03-23 02:02:14,514 INFO [master/nn3/192.168.80.51:16000] >>> regionserver.HRegionServer: stopping server nn3,16000,1490185417271; all >>> regions closed. >>> 2017-03-23 02:02:14,532 INFO [master/nn3/192.168.80.51:16000] >>> hbase.ChoreService: Chore service for: nn3,16000,1490185417271 had >>> [[ScheduledChore: Name: CatalogJanitor-nn3:16000 Period: 300000 Unit: >>> MILLISECONDS], [ScheduledChore: Name: LogsCleaner Period: 60000 Unit: >>> MILLISECONDS], [ScheduledChore: Name: >>> nn3,16000,1490185417271-ExpiredMobFileCleanerChore >>> Period: 86400 Unit: SECONDS], [ScheduledChore: Name: >>> nn3,16000,1490185417271-MobCompactionChore Period: 604800 Unit: >>> SECONDS], [ScheduledChore: Name: nn3,16000,1490185417271-ClusterStatusChore >>> Period: 60000 Unit: MILLISECONDS], [ScheduledChore: Name: >>> nn3,16000,1490185417271-BalancerChore Period: 300000 Unit: >>> MILLISECONDS], [ScheduledChore: Name: HFileCleaner Period: 60000 Unit: >>> MILLISECONDS], [ScheduledChore: Name: >>> nn3,16000,1490185417271-RegionNormalizerChore >>> Period: 1800000 Unit: MILLISECONDS]] on shutdown >>> 2017-03-23 02:02:14,630 INFO [master/nn3/192.168.80.51:16000] >>> master.MasterMobCompactionThread: Waiting for Mob Compaction Thread to >>> finish... >>> 2017-03-23 02:02:14,644 INFO [master/nn3/192.168.80.51:16000] >>> master.MasterMobCompactionThread: Waiting for Region Server Mob >>> Compaction Thread to finish... >>> 2017-03-23 02:02:14,671 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:02:15,684 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:02:17,684 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:02:21,685 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:02:29,685 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:02:45,686 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:03:17,686 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:04:21,686 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> exception=org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> 2017-03-23 02:04:21,687 ERROR [master/nn3/192.168.80.51:16000] >>> zookeeper.RecoverableZooKeeper: ZooKeeper getData failed after 7 >>> attempts >>> 2017-03-23 02:04:21,687 WARN [master/nn3/192.168.80.51:16000] >>> zookeeper.ZKUtil: master:16000-0x15adbb9b9db078a, >>> quorum=bigdata33:2181,bigdata36:2181,nn3:2181, >>> baseZNode=/hbase-unsecure Unable to get data of znode /hbase-unsecure/master >>> org.apache.zookeeper.KeeperException$SessionExpiredException: >>> KeeperErrorCode = Session expired for /hbase-unsecure/master >>> ... >>> >>> >>> >>> >>> >>> hbase-site.xml: >>> <configuration> >>> >>> <property> >>> <name>dfs.client.read.shortcircuit</name> >>> <value>true</value> >>> </property> >>> >>> <property> >>> <name>dfs.domain.socket.path</name> >>> <value>/var/lib/hadoop-hdfs/dn_socket</value> >>> </property> >>> >>> <property> >>> <name>hbase.bulkload.staging.dir</name> >>> <value>/apps/hbase/staging</value> >>> </property> >>> >>> <property> >>> <name>hbase.client.keyvalue.maxsize</name> >>> <value>1048576</value> >>> </property> >>> >>> <property> >>> <name>hbase.client.retries.number</name> >>> <value>35</value> >>> </property> >>> >>> <property> >>> <name>hbase.client.scanner.caching</name> >>> <value>100</value> >>> </property> >>> >>> <property> >>> <name>hbase.client.scanner.timeout.period</name> >>> <value>600000</value> >>> </property> >>> >>> <property> >>> <name>hbase.cluster.distributed</name> >>> <value>true</value> >>> </property> >>> >>> <property> >>> <name>hbase.coprocessor.master.classes</name> >>> <value>org.apache.hadoop.hbase.security.visibility.Visibilit >>> yController,org.apache.ranger.authorization.hbase.RangerAuth >>> orizationCoprocessor</value> >>> </property> >>> >>> <property> >>> <name>hbase.coprocessor.region.classes</name> >>> <value>org.apache.hadoop.hbase.security.visibility.Visibilit >>> yController,org.apache.hadoop.hbase.security.access. >>> SecureBulkLoadEndpoint,org.apache.ranger.authorization. >>> hbase.RangerAuthorizationCoprocessor</value> >>> </property> >>> >>> <property> >>> <name>hbase.coprocessor.regionserver.classes</name> >>> <value>org.apache.ranger.authorization.hbase.RangerAuthoriza >>> tionCoprocessor</value> >>> </property> >>> <property> >>> <name>hbase.hregion.majorcompaction</name> >>> <value>604800000</value> >>> </property> >>> >>> <property> >>> <name>hbase.hregion.majorcompaction.jitter</name> >>> <value>0.50</value> >>> </property> >>> >>> <property> >>> <name>hbase.hregion.max.filesize</name> >>> <value>10737418240</value> >>> </property> >>> >>> <property> >>> <name>hbase.hregion.memstore.block.multiplier</name> >>> <value>4</value> >>> </property> >>> >>> <property> >>> <name>hbase.hregion.memstore.flush.size</name> >>> <value>134217728</value> >>> </property> >>> >>> <property> >>> <name>hbase.hregion.memstore.mslab.enabled</name> >>> <value>true</value> >>> </property> >>> >>> <property> >>> <name>hbase.hstore.blockingStoreFiles</name> >>> <value>10</value> >>> </property> >>> >>> <property> >>> <name>hbase.hstore.compaction.max</name> >>> <value>10</value> >>> </property> >>> >>> <property> >>> <name>hbase.hstore.compactionThreshold</name> >>> <value>3</value> >>> </property> >>> >>> <property> >>> <name>hbase.local.dir</name> >>> <value>${hbase.tmp.dir}/local</value> >>> </property> >>> <property> >>> <name>hbase.master.info.bindAddress</name> >>> <value>0.0.0.0</value> >>> </property> >>> >>> <property> >>> <name>hbase.master.info.port</name> >>> <value>16010</value> >>> </property> >>> >>> <property> >>> <name>hbase.master.loadbalance.bytable</name> >>> <value>true</value> >>> </property> >>> >>> <property> >>> <name>hbase.master.port</name> >>> <value>16000</value> >>> </property> >>> >>> <property> >>> <name>hbase.master.ui.readonly</name> >>> <value>false</value> >>> </property> >>> >>> <property> >>> <name>hbase.regionserver.global.memstore.size</name> >>> <value>0.4</value> >>> </property> >>> >>> <property> >>> <name>hbase.regionserver.handler.count</name> >>> <value>30</value> >>> </property> >>> >>> <property> >>> <name>hbase.regionserver.info.port</name> >>> <value>16030</value> >>> </property> >>> >>> <property> >>> <name>hbase.regionserver.port</name> >>> <value>16020</value> >>> </property> >>> >>> <property> >>> <name>hbase.regionserver.wal.codec</name> >>> <value>org.apache.hadoop.hbase.regionserver.wal.WALCellCodec</value> >>> </property> >>> >>> <property> >>> <name>hbase.rootdir</name> >>> <value>hdfs://nn3:8020/apps/hbase/data</value> >>> </property> >>> >>> <property> >>> <name>hbase.rpc.protection</name> >>> <value>authentication</value> >>> </property> >>> >>> <property> >>> <name>hbase.rpc.timeout</name> >>> <value>90000</value> >>> </property> >>> >>> <property> >>> <name>hbase.security.authentication</name> >>> <value>simple</value> >>> </property> >>> >>> <property> >>> <name>hbase.security.authorization</name> >>> <value>true</value> >>> </property> >>> >>> <property> >>> <name>hbase.superuser</name> >>> <value>hbase</value> >>> </property> >>> >>> <property> >>> <name>hbase.tmp.dir</name> >>> <value>/tmp/hbase-${user.name}</value> >>> </property> >>> >>> <property> >>> <name>hbase.zookeeper.property.clientPort</name> >>> <value>2181</value> >>> </property> >>> >>> <property> >>> <name>hbase.zookeeper.quorum</name> >>> <value>bigdata33,bigdata36,nn3</value> >>> </property> >>> >>> <property> >>> <name>hbase.zookeeper.useMulti</name> >>> <value>true</value> >>> </property> >>> >>> <property> >>> <name>hfile.block.cache.size</name> >>> <value>0.4</value> >>> </property> >>> >>> <property> >>> <name>hfile.format.version</name> >>> <value>3</value> >>> </property> >>> >>> <property> >>> <name>phoenix.query.timeoutMs</name> >>> <value>60000</value> >>> </property> >>> >>> <property> >>> <name>replication.executor.workers</name> >>> <value>2</value> >>> </property> >>> >>> <property> >>> <name>replication.sleep.before.failover</name> >>> <value>60000</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.recovery.retry</name> >>> <value>6</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.session.timeout</name> >>> <value>90000</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.znode.parent</name> >>> <value>/hbase-unsecure</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.znode.replication</name> >>> <value>replication</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.znode.replication.peers</name> >>> <value>peers</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.znode.replication.peers.state</name> >>> <value>peer-state</value> >>> </property> >>> >>> <property> >>> <name>zookeeper.znode.replication.rs</name> >>> <value>rs</value> >>> </property> >>> >>> </configuration> >>> >>> Any hints? >>> >>> -- >>> Margus (margusja) Roo >>> http://margus.roo.ee >>> skype: margusja >>> https://www.facebook.com/allan.tuuring >>> +372 51 48 780 >>> >>> >