Hey everyone,

We are noticing a file descriptor leak that is only affecting nodes in our
cluster running 5.7.0, not those still running 5.3.8. I ran an lsof against
an affected regionserver, and noticed that there were 10k+ unix sockets
that are just called "socket", as well as another 10k+ of the form
"/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
2 seem related based on how closely the counts match.

We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
handled the namenode upgrade separately).  The 5.3.8 nodes *do not*
experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
more regionservers until we can figure this out. I'm not sure if any
intermediate versions between the 2 have the issue.

We traced the root cause to a hadoop job running against a basic table:

'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
{'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}

This is very similar to all of our other tables (we have many). However,
it's regions are getting up there in size, 40+gb per region, compressed.
This has not been an issue for us previously.

The hadoop job is a simple TableMapper job with no special parameters,
though we haven't updated our client yet to the latest (will do that once
we finish the server side). The hadoop job runs on a separate hadoop
cluster, remotely accessing the HBase cluster. It does not do any other
reads or writes, outside of the TableMapper scans.

Moving the regions off of an affected server, or killing the hadoop job,
causes the file descriptors to gradually go back down to normal.

Any ideas?

Thanks,

Bryan

Reply via email to