On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <bbeaudrea...@hubspot.com
> wrote:

> Hey everyone,
>
> We are noticing a file descriptor leak that is only affecting nodes in our
> cluster running 5.7.0, not those still running 5.3.8.


Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs hbase-0.98.6+hadoop-2.5.0.


> I ran an lsof against
> an affected regionserver, and noticed that there were 10k+ unix sockets
> that are just called "socket", as well as another 10k+ of the form
> "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
> 2 seem related based on how closely the counts match.
>
> We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
> handled the namenode upgrade separately).  The 5.3.8 nodes *do not*
> experience this issue. The 5.7.0 nodes *do. *We are holding off upgrading
> more regionservers until we can figure this out. I'm not sure if any
> intermediate versions between the 2 have the issue.
>
> We traced the root cause to a hadoop job running against a basic table:
>
> 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
> MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
> BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
> {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}
>
> This is very similar to all of our other tables (we have many).


You are doing MR against some of these also? They have different schemas?
No leaks here?



> However,
> it's regions are getting up there in size, 40+gb per region, compressed.
> This has not been an issue for us previously.
>
> The hadoop job is a simple TableMapper job with no special parameters,
> though we haven't updated our client yet to the latest (will do that once
> we finish the server side). The hadoop job runs on a separate hadoop
> cluster, remotely accessing the HBase cluster. It does not do any other
> reads or writes, outside of the TableMapper scans.
>
> Moving the regions off of an affected server, or killing the hadoop job,
> causes the file descriptors to gradually go back down to normal.
>
>
Any ideas?
>
>
Is it just the FD cache running 'normally'? 10k seems like a lot though.
256 seems to be the default in hdfs but maybe it is different in CM or in
hbase?

What is your dfs.client.read.shortcircuit.streams.cache.size set to?
St.Ack



> Thanks,
>
> Bryan
>

Reply via email to