I checked all tablet servers across all six of our environments and it seems to be present in all of them, with some having upwards of 73k connections.
I disabled replication in our dev cluster and restarted the tablet servers. Left it running overnight and checked the connections -- a reasonable number in the single or double digits. Enabling replication lead to a quick climb in the CLOSE_WAIT connections to a couple thousand, leading me to think it is some lingering connection reading a WAL file from HDFS. I've opened ACCUMULO-4787 <https://issues.apache.org/jira/browse/ACCUMULO-4787> to track this and we can move discussion over there. --Adam On Thu, Jan 25, 2018 at 12:23 PM, Christopher <ctubb...@apache.org> wrote: > Interesting. It's possible we're mishandling an IOException from DFSClient > or something... but it's also possible there's a bug in DFSClient > somewhere. I found a few similar issues from the past... some might still > be not fully resolved: > > https://issues.apache.org/jira/browse/HDFS-1836 > https://issues.apache.org/jira/browse/HDFS-2028 > https://issues.apache.org/jira/browse/HDFS-6973 > https://issues.apache.org/jira/browse/HBASE-9393 > > The HBASE issue is interesting, because it indicates a new HDFS feature in > 2.6.4 to clear readahead buffers/sockets (https://issues.apache.org/ > jira/browse/HDFS-7694). That might be a feature we're not yet utilizing, > but it would only work on a newer version of HDFS. > > I would probably also try to grab some jstacks of the tserver, to try to > figure out what HDFS client code paths are being taken to see where the > leak might be occurring. Also, if you have any debug logs for the tserver, > that might help. There might be some DEBUG or WARN items that indicate > retries or other failures failures that are occurring, but perhaps handled > improperly. > > It's probably less likely, but it could also be a Java or Linux issue. I > wouldn't even know where to begin debugging at that level, though, other > than to check for OS updates. What JVM are you running? > > It's possible it's not a leak... and these are just getting cleaned up too > slowly. That might be something that can be tuned with sysctl. > > On Thu, Jan 25, 2018 at 11:27 AM Adam J. Shook <adamjsh...@gmail.com> > wrote: > >> We're running Ubuntu 14.04, HDFS 2.6.0, ZooKeeper 3.4.6, and Accumulo >> 1.8.1. I'm using `lsof -i` and grepping for the tserver PID to list all >> the connections. Just now there are ~25k connections for this one tserver, >> of which 99.9% of them are all writing to various DataNodes on port 50010. >> It's split about 50/50 for connections that are CLOSED_WAIT and ones that >> are ESTABLISHED. No special RPC configuration. >> >> On Wed, Jan 24, 2018 at 7:53 PM, Josh Elser <josh.el...@gmail.com> wrote: >> >>> +1 to looking at the remote end of the socket and see where they're >>> going/coming to/from. I've seen a few HDFS JIRA issues filed about sockets >>> left in CLOSED_WAIT. >>> >>> Lucky you, this is a fun Linux rabbit hole to go down :) >>> >>> (https://blog.cloudflare.com/this-is-strictly-a-violation- >>> of-the-tcp-specification/ covers some of the technical details) >>> >>> On 1/24/18 6:37 PM, Christopher wrote: >>> >>>> I haven't seen that, but I'm curious what OS, Hadoop, ZooKeeper, and >>>> Accumulo version you're running. I'm assuming you verified that it was the >>>> TabletServer process holding these TCP sockets open using `netstat -p` and >>>> cross-referencing the PID with `jps -ml` (or similar)? Are you able to >>>> confirm based on the port number that these were Thrift connections or >>>> could they be ZooKeeper or Hadoop connections? Do you have any special >>>> non-default Accumulo RPC configuration (SSL or SASL)? >>>> >>>> On Wed, Jan 24, 2018 at 3:46 PM Adam J. Shook <adamjsh...@gmail.com >>>> <mailto:adamjsh...@gmail.com>> wrote: >>>> >>>> Hello all, >>>> >>>> Has anyone come across an issue with a TabletServer occupying a >>>> large number of ports in a CLOSED_WAIT state? 'Normal' number of >>>> used ports on a 12-node cluster are around 12,000 to 20,000 ports. >>>> In one instance, there were over 68k and it was affecting other >>>> applications from getting a free port and they would fail to start >>>> (which is how we found this in the first place). >>>> >>>> Thank you, >>>> --Adam >>>> >>>> >>