For reference, the Scan backing the job is pretty basic:

Scan scan = new Scan();
scan.setCaching(500); // probably too small for the datasize we're dealing
with
scan.setCacheBlocks(false);
scan.setScanMetricsEnabled(true);
scan.setMaxVersions(1);
scan.setTimeRange(startTime, stopTime);

Otherwise it is using the out-of-the-box TableInputFormat.



On Mon, May 23, 2016 at 3:13 PM Bryan Beaudreault <bbeaudrea...@hubspot.com>
wrote:

> I've forced the issue to happen again. netstat takes a while to run on
> this host while it's happening, but I do not see an abnormal amount of
> CLOSE_WAIT (compared to other hosts).
>
> I forced more than usual number of regions for the affected table onto the
> host to speed up the process.  File Descriptors are now growing quite
> rapidly, about 8-10 per second.
>
> This is what lsof looks like, multiplied by a couple thousand:
>
> COMMAND   PID  USER   FD      TYPE             DEVICE    SIZE/OFF
> NODE NAME
> java    23180 hbase  DEL    REG               0,16             3848784656
> /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823
> java    23180 hbase  DEL    REG               0,16             3847643924
> /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966
> java    23180 hbase  DEL    REG               0,16             3847614191
> /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288
>
> The only thing that varies is the last int on the end.
>
> > Anything about the job itself that is holding open references or
> throwing away files w/o closing them?
>
> The MR job does a TableMapper directly against HBase, which as far as I
> know uses the HBase RPC and does not hit HDFS directly at all. Is it
> possible that a long running scan (one with many, many next() calls) could
> keep some references to HDFS open for the duration of the overall scan?
>
>
> On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault <
> bbeaudrea...@hubspot.com> wrote:
>
>> We run MR against many tables in all of our clusters, they mostly have
>> similar schema definitions though vary in terms of key length, # columns,
>> etc. This is the only cluster and only table we've seen leak so far. It's
>> probably the table with the biggest regions which we MR against, though
>> it's hard to verify that (anyone in engineering can run such a job).
>>
>> dfs.client.read.shortcircuit.streams.cache.size = 256
>>
>> Our typical FD amount is around 3000. When this hadoop job runs, that
>> can climb up to our limit of over 30k if we don't act -- it is a gradual
>> build up over the course of a couple hours. When we move the regions off or
>> kill the job, the FDs will gradually go back down at roughly the same pace.
>> It forms a graph in the shape of a pyramid.
>>
>> We don't use CM, we use mostly the default *-site.xml. We haven't
>> overridden anything related to this. The configs between CDH5.3.8 and 5.7.0
>> are identical for us.
>>
>> On Mon, May 23, 2016 at 2:03 PM Stack <st...@duboce.net> wrote:
>>
>>> On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault <
>>> bbeaudrea...@hubspot.com
>>> > wrote:
>>>
>>> > Hey everyone,
>>> >
>>> > We are noticing a file descriptor leak that is only affecting nodes in
>>> our
>>> > cluster running 5.7.0, not those still running 5.3.8.
>>>
>>>
>>> Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs
>>> hbase-0.98.6+hadoop-2.5.0.
>>>
>>>
>>> > I ran an lsof against
>>> > an affected regionserver, and noticed that there were 10k+ unix sockets
>>> > that are just called "socket", as well as another 10k+ of the form
>>> >
>>> "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The
>>> > 2 seem related based on how closely the counts match.
>>> >
>>> > We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we
>>> > handled the namenode upgrade separately).  The 5.3.8 nodes *do not*
>>> > experience this issue. The 5.7.0 nodes *do. *We are holding off
>>> upgrading
>>> > more regionservers until we can figure this out. I'm not sure if any
>>> > intermediate versions between the 2 have the issue.
>>> >
>>> > We traced the root cause to a hadoop job running against a basic table:
>>> >
>>> > 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400',
>>> > MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50',
>>> > BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA =>
>>> > {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}}
>>> >
>>> > This is very similar to all of our other tables (we have many).
>>>
>>>
>>> You are doing MR against some of these also? They have different schemas?
>>> No leaks here?
>>>
>>>
>>>
>>> > However,
>>> > it's regions are getting up there in size, 40+gb per region,
>>> compressed.
>>> > This has not been an issue for us previously.
>>> >
>>> > The hadoop job is a simple TableMapper job with no special parameters,
>>> > though we haven't updated our client yet to the latest (will do that
>>> once
>>> > we finish the server side). The hadoop job runs on a separate hadoop
>>> > cluster, remotely accessing the HBase cluster. It does not do any other
>>> > reads or writes, outside of the TableMapper scans.
>>> >
>>> > Moving the regions off of an affected server, or killing the hadoop
>>> job,
>>> > causes the file descriptors to gradually go back down to normal.
>>> >
>>> >
>>> Any ideas?
>>> >
>>> >
>>> Is it just the FD cache running 'normally'? 10k seems like a lot though.
>>> 256 seems to be the default in hdfs but maybe it is different in CM or in
>>> hbase?
>>>
>>> What is your dfs.client.read.shortcircuit.streams.cache.size set to?
>>> St.Ack
>>>
>>>
>>>
>>> > Thanks,
>>> >
>>> > Bryan
>>> >
>>>
>>

Reply via email to