>
> Oh one other thought.  It could have been a client of another process
> holding that port, say hdfs.  You could use the linux command lsof to
> verify that is the problem
>

Unfortunately we did not take the lsof dump at that time. This could've
clearly pin-pointed some problems. But our controller servers were the only
ones connecting to 40020 & we restarted them too.

I'm guessing if the process was in an OOM state and when you killed the
> proc is turned into a zombie process for a moment.  I have seen that happen
> with java processes before.  There's not a lot that can be down except wait
> for the process to be terminated


I think this is what happened. The load-average was quite high & system
freezed up for 10 mins. It was natural that restarts failed.

But what worries me is, for nearly 45 minutes after the shard-process quit
(well after load-average dropped down to < 1) the thrift sockets were still
in TIMED_WAIT & did not close cleanly & as a result multiple restart
attempts failed.

Even more puzzling is, ThriftServer has SOCKET_REUSE_ADDR set to true by
default & is meant to specifically address TIMED_WAIT thrift ports. But
still restarts failed.

--
Ravi

On Wed, Apr 27, 2016 at 7:24 AM, Tim Williams <[email protected]> wrote:

> On Tue, Apr 26, 2016 at 2:50 AM, Ravikumar Govindarajan
> <[email protected]> wrote:
> > A shard-server was heavily loaded yesterday & ultimately crashed with an
> > OOM.
> >
> > I tried to restart the shard-server but it quit with the following error
> >
> > INFO  20160425_05:10:02:556_PDT [main] thrift.ThriftBlurShardServer:
> > Setting up Shard Server
> > INFO  20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit:
> > core file size          (blocks, -c) unlimited
> > INFO  20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit:
> > data seg size           (kbytes, -d) unlimited
> > INFO  20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit:
> > scheduling priority             (-e) 0
> > INFO  20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit:
> > file size               (blocks, -f) unlimited
> > INFO  20160425_05:10:02:581_PDT [main] thrift.ThriftServer: ulimit:
> > pending signals                 (-i) 1031474
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > max locked memory       (kbytes, -l) 64
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > max memory size         (kbytes, -m) unlimited
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > open files                      (-n) 65536
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > pipe size            (512 bytes, -p) 8
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > POSIX message queues     (bytes, -q) 819200
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > real-time priority              (-r) 0
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > stack size              (kbytes, -s) 10240
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > cpu time               (seconds, -t) unlimited
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > max user processes              (-u) 3047
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > virtual memory          (kbytes, -v) unlimited
> > INFO  20160425_05:10:02:582_PDT [main] thrift.ThriftServer: ulimit:
> > file locks                      (-x) unlimited
> > INFO  20160425_05:10:02:686_PDT [main] utils.GCWatcherJdk7:
> > GCWatcherJdk7 was setup.
> > ERROR 20160425_05:10:02:719_PDT [main]
> > concurrent.SimpleUncaughtExceptionHandler: Unknown error in thread
> > [Thread[main,5,main]]
> > org.apache.blur.thirdparty.thrift_0_9_0.transport.TTransportException:
> > Could not create ServerSocket on address /0.0.0.0:40020.
> >         at
> org.apache.blur.thirdparty.thrift_0_9_0.transport.TNonblockingServerSocket.<init>(TNonblockingServerSocket.java:91)
> >         at
> org.apache.blur.thirdparty.thrift_0_9_0.transport.TNonblockingServerSocket.<init>(TNonblockingServerSocket.java:73)
> >         at
> org.apache.blur.thrift.ThriftServer.getTNonblockingServerSocket(ThriftServer.java:246)
> >         at
> org.apache.blur.thrift.ThriftBlurShardServer.createServer(ThriftBlurShardServer.java:155)
> >         at
> org.apache.blur.thrift.ThriftBlurShardServer.main(ThriftBlurShardServer.java:139)
> >
> > The shard-server process was killed but thrift port seemed to be hanging
> on
> > (Guess it was on TIMED_WAIT) & not released.
> >
> > I also saw HttpJettyServer reporting the same issue when I attempted
> > restart for a second time...
> >
> > *Caused by: java.net.BindException: Address already in use*
> > at java.net.PlainSocketImpl.socketBind(Native Method)
> > at
> java.net.AbstractPlainSocketImpl.bind(AbstractPlainSocketImpl.java:376)
> > at java.net.ServerSocket.bind(ServerSocket.java:376)
> > at java.net.ServerSocket.(ServerSocket.java:237)
> > at java.net.ServerSocket.(ServerSocket.java:181)
> > at
> >
> org.mortbay.jetty.bio.SocketConnector.newServerSocket(SocketConnector.java:80)
> > at org.mortbay.jetty.bio.SocketConnector.open(SocketConnector.java:73)
> > at
> org.mortbay.jetty.AbstractConnector.doStart(AbstractConnector.java:283)
> > at
> org.mortbay.jetty.bio.SocketConnector.doStart(SocketConnector.java:147)
> > at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> > at org.mortbay.jetty.Server.doStart(Server.java:235)
> > at
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
> > at org.apache.blur.gui.HttpJettyServer.(HttpJettyServer.java:93)
>
> How much time between killing and starting it?  Did you check the
> network state (e.g. netsat -pan | grep 40020)?
>
> --tim
>

Reply via email to