For reference, the Scan backing the job is pretty basic: Scan scan = new Scan(); scan.setCaching(500); // probably too small for the datasize we're dealing with scan.setCacheBlocks(false); scan.setScanMetricsEnabled(true); scan.setMaxVersions(1); scan.setTimeRange(startTime, stopTime);
Otherwise it is using the out-of-the-box TableInputFormat. On Mon, May 23, 2016 at 3:13 PM Bryan Beaudreault <bbeaudrea...@hubspot.com> wrote: > I've forced the issue to happen again. netstat takes a while to run on > this host while it's happening, but I do not see an abnormal amount of > CLOSE_WAIT (compared to other hosts). > > I forced more than usual number of regions for the affected table onto the > host to speed up the process. File Descriptors are now growing quite > rapidly, about 8-10 per second. > > This is what lsof looks like, multiplied by a couple thousand: > > COMMAND PID USER FD TYPE DEVICE SIZE/OFF > NODE NAME > java 23180 hbase DEL REG 0,16 3848784656 > /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1702253823 > java 23180 hbase DEL REG 0,16 3847643924 > /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_1614925966 > java 23180 hbase DEL REG 0,16 3847614191 > /dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_377023711_1_888427288 > > The only thing that varies is the last int on the end. > > > Anything about the job itself that is holding open references or > throwing away files w/o closing them? > > The MR job does a TableMapper directly against HBase, which as far as I > know uses the HBase RPC and does not hit HDFS directly at all. Is it > possible that a long running scan (one with many, many next() calls) could > keep some references to HDFS open for the duration of the overall scan? > > > On Mon, May 23, 2016 at 2:19 PM Bryan Beaudreault < > bbeaudrea...@hubspot.com> wrote: > >> We run MR against many tables in all of our clusters, they mostly have >> similar schema definitions though vary in terms of key length, # columns, >> etc. This is the only cluster and only table we've seen leak so far. It's >> probably the table with the biggest regions which we MR against, though >> it's hard to verify that (anyone in engineering can run such a job). >> >> dfs.client.read.shortcircuit.streams.cache.size = 256 >> >> Our typical FD amount is around 3000. When this hadoop job runs, that >> can climb up to our limit of over 30k if we don't act -- it is a gradual >> build up over the course of a couple hours. When we move the regions off or >> kill the job, the FDs will gradually go back down at roughly the same pace. >> It forms a graph in the shape of a pyramid. >> >> We don't use CM, we use mostly the default *-site.xml. We haven't >> overridden anything related to this. The configs between CDH5.3.8 and 5.7.0 >> are identical for us. >> >> On Mon, May 23, 2016 at 2:03 PM Stack <st...@duboce.net> wrote: >> >>> On Mon, May 23, 2016 at 9:55 AM, Bryan Beaudreault < >>> bbeaudrea...@hubspot.com >>> > wrote: >>> >>> > Hey everyone, >>> > >>> > We are noticing a file descriptor leak that is only affecting nodes in >>> our >>> > cluster running 5.7.0, not those still running 5.3.8. >>> >>> >>> Translation: roughly hbase-1.2.0+hadoop-2.6.0 vs >>> hbase-0.98.6+hadoop-2.5.0. >>> >>> >>> > I ran an lsof against >>> > an affected regionserver, and noticed that there were 10k+ unix sockets >>> > that are just called "socket", as well as another 10k+ of the form >>> > >>> "/dev/shm/HadoopShortCircuitShm_DFSClient_NONMAPREDUCE_-<int>_1_<int>". The >>> > 2 seem related based on how closely the counts match. >>> > >>> > We are in the middle of a rolling upgrade from CDH5.3.8 to CDH5.7.0 (we >>> > handled the namenode upgrade separately). The 5.3.8 nodes *do not* >>> > experience this issue. The 5.7.0 nodes *do. *We are holding off >>> upgrading >>> > more regionservers until we can figure this out. I'm not sure if any >>> > intermediate versions between the 2 have the issue. >>> > >>> > We traced the root cause to a hadoop job running against a basic table: >>> > >>> > 'my-table-1', {TABLE_ATTRIBUTES => {MAX_FILESIZE => '107374182400', >>> > MEMSTORE_FLUSHSIZE => '67108864'}, {NAME => '0', VERSIONS => '50', >>> > BLOOMFILTER => 'NONE', COMPRESSION => 'LZO', METADATA => >>> > {'COMPRESSION_COMPACT' => 'LZO', 'ENCODE_ON_DISK' => 'true'}} >>> > >>> > This is very similar to all of our other tables (we have many). >>> >>> >>> You are doing MR against some of these also? They have different schemas? >>> No leaks here? >>> >>> >>> >>> > However, >>> > it's regions are getting up there in size, 40+gb per region, >>> compressed. >>> > This has not been an issue for us previously. >>> > >>> > The hadoop job is a simple TableMapper job with no special parameters, >>> > though we haven't updated our client yet to the latest (will do that >>> once >>> > we finish the server side). The hadoop job runs on a separate hadoop >>> > cluster, remotely accessing the HBase cluster. It does not do any other >>> > reads or writes, outside of the TableMapper scans. >>> > >>> > Moving the regions off of an affected server, or killing the hadoop >>> job, >>> > causes the file descriptors to gradually go back down to normal. >>> > >>> > >>> Any ideas? >>> > >>> > >>> Is it just the FD cache running 'normally'? 10k seems like a lot though. >>> 256 seems to be the default in hdfs but maybe it is different in CM or in >>> hbase? >>> >>> What is your dfs.client.read.shortcircuit.streams.cache.size set to? >>> St.Ack >>> >>> >>> >>> > Thanks, >>> > >>> > Bryan >>> > >>> >>