> > Oh one other thought. It could have been a client of another process > holding that port, say hdfs. You could use the linux command lsof to > verify that is the problem >
Unfortunately we did not take the lsof dump at that time. This could've clearly pin-pointed some problems. But our controller servers were the only ones connecting to 40020 & we restarted them too. I'm guessing if the process was in an OOM state and when you killed the > proc is turned into a zombie process for a moment. I have seen that happen > with java processes before. There's not a lot that can be down except wait > for the process to be terminated I think this is what happened. The load-average was quite high & system freezed up for 10 mins. It was natural that restarts failed. But what worries me is, for nearly 45 minutes after the shard-process quit (well after load-average dropped down to < 1) the thrift sockets were still in TIMED_WAIT & did not close cleanly & as a result multiple restart attempts failed. Even more puzzling is, ThriftServer has SOCKET_REUSE_ADDR set to true by default & is meant to specifically address TIMED_WAIT thrift ports. But still restarts failed. -- Ravi On Wed, Apr 27, 2016 at 7:24 AM, Tim Williams <[email protected]> wrote: > On Tue, Apr 26, 2016 at 2:50 AM, Ravikumar Govindarajan > <[email protected]> wrote: > > A shard-server was heavily loaded yesterday & ultimately crashed with an > > OOM. > > > > I tried to restart the shard-server but it quit with the following error > > > > INFO 20160425_05:10:02:556_PDT [main] thrift.ThriftBlurShardServer: > > Setting up Shard Server > > INFO 20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit: > > core file size (blocks, -c) unlimited > > INFO 20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit: > > data seg size (kbytes, -d) unlimited > > INFO 20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit: > > scheduling priority (-e) 0 > > INFO 20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit: > > file size (blocks, -f) unlimited > > INFO 20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit: > > pending signals (-i) 1031474 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > max locked memory (kbytes, -l) 64 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > max memory size (kbytes, -m) unlimited > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > open files (-n) 65536 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > pipe size (512 bytes, -p) 8 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > POSIX message queues (bytes, -q) 819200 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > real-time priority (-r) 0 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > stack size (kbytes, -s) 10240 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > cpu time (seconds, -t) unlimited > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > max user processes (-u) 3047 > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > virtual memory (kbytes, -v) unlimited > > INFO 20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit: > > file locks (-x) unlimited > > INFO 20160425_05:10:02:686_PDT [main] utils.GCWatcherJdk7: > > GCWatcherJdk7 was setup. > > ERROR 20160425_05:10:02:719_PDT [main] > > concurrent.SimpleUncaughtExceptionHandler: Unknown error in thread > > [Thread[main,5,main]] > > org.apache.blur.thirdparty.thrift_0_9_0.transport.TTransportException: > > Could not create ServerSocket on address /0.0.0.0:40020. > > at > org.apache.blur.thirdparty.thrift_0_9_0.transport.TNonblockingServerSocket.<init>(TNonblockingServerSocket.java:91) > > at > org.apache.blur.thirdparty.thrift_0_9_0.transport.TNonblockingServerSocket.<init>(TNonblockingServerSocket.java:73) > > at > org.apache.blur.thrift.ThriftServer.getTNonblockingServerSocket(ThriftServer.java:246) > > at > org.apache.blur.thrift.ThriftBlurShardServer.createServer(ThriftBlurShardServer.java:155) > > at > org.apache.blur.thrift.ThriftBlurShardServer.main(ThriftBlurShardServer.java:139) > > > > The shard-server process was killed but thrift port seemed to be hanging > on > > (Guess it was on TIMED_WAIT) & not released. > > > > I also saw HttpJettyServer reporting the same issue when I attempted > > restart for a second time... > > > > *Caused by: java.net.BindException: Address already in use* > > at java.net.PlainSocketImpl.socketBind(Native Method) > > at > java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376) > > at java.net.ServerSocket.bind(ServerSocket.java:376) > > at java.net.ServerSocket.(ServerSocket.java:237) > > at java.net.ServerSocket.(ServerSocket.java:181) > > at > > > org.mortbay.jetty.bio.SocketConnector.newServerSocket(SocketConnector.java:80) > > at org.mortbay.jetty.bio.SocketConnector.open(SocketConnector.java:73) > > at > org.mortbay.jetty.AbstractConnector.doStart(AbstractConnector.java:283) > > at > org.mortbay.jetty.bio.SocketConnector.doStart(SocketConnector.java:147) > > at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) > > at org.mortbay.jetty.Server.doStart(Server.java:235) > > at > org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50) > > at org.apache.blur.gui.HttpJettyServer.(HttpJettyServer.java:93) > > How much time between killing and starting it? Did you check the > network state (e.g. netsat -pan | grep 40020)? > > --tim >
