[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15251110#comment-15251110
 ] 

Karthik Kambatla commented on MAPREDUCE-6684:
---------------------------------------------

Was looking into this along with [~haibochen]. The two tarballs of jstacks 
attached here correspond to a monitoring process turned on/off; the monitoring 
process has threads connecting to the JHS and asking for details about a job. 
We have seen a few problems:
# JHS continues to process a request (scan directories and fetch file) even 
after the requesting connection has terminated. 
# As Haibo mentioned, each thread scans the intermediate directory (and moves 
any available files to done dir). This scan is protected by a lock on the 
intermediate directory. When there are multiple threads, all threads but one 
are blocked. Couple of problems with this:
## Threads that are trying to load a file that is already in the done_dir also 
seem to be blocked on scanning the intermediate dir
## Even though a thread with the lock might process files required by other 
threads, those threads redo the scan (potentially moving other files they don't 
need) leading to longer turn-around-time per thread and maxing out the number 
of threads (default of 256). 

> High contention on scanning of user directory under immediate_done in Job 
> History Server
> ----------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6684
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.7.0
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>         Attachments: jhs-jstacks-service-monitor-running.tar.gz, 
> jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each 
> user directory it tries to scan (move or delete files under the user 
> directory as necessary). This method is called in a thread in JobHistory that 
> performs periodical scanning of intermediate directory, and can also be 
> called by web server threads for each Web API call made by a JHS client. In 
> cases where there are many concurrent Web API calls/connections to JHS, all 
> but one thread are blocked on the lock on the user directory. Eventually, 
> client connects will time out, but the threads in JHS will not be killed and 
> leave a lot of TCP connections in CLOSE_WAIT state. 
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp        0      0 10.17.202.19:10020          0.0.0.0:*                   
> LISTEN      63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.198.30:33010          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.200.30:33980          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.10:59625          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.13:35765          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10033          0.0.0.0:*                   
> LISTEN      63729/java          
> tcp        0      0 10.17.202.19:19888          0.0.0.0:*                   
> LISTEN      63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:35103          
> ESTABLISHED 63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:43670          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:45453          
> ESTABLISHED 63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:49184          
> ESTABLISHED 63729/java          
> tcp        1      0 10.17.202.19:19888          10.17.202.13:49992          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52703          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52707          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52708          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52710          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52714          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52723          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52726          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52727          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52739          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52749          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52753          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52757          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52760          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52820          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52827          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52829          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52831          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52833          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52836          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52839          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52841          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52843          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52850          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52860          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52876          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52879          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52881          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52884          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52886          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52888          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52891          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52893          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52896          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52898          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52899          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52902          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52909          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52910          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52912          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52923          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52925          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52927          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52930          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52937          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52939          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52945          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52947          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52969          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52972          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52975          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53004          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53007          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53009          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53011          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53052          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53058          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53059          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53063          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53071          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53084          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53093          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53095          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53097          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53101          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53104          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53106          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53108          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53110          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53112          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53114          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53115          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53117          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53121          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53123          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53125          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53127          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53129          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53131          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53134          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53138          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53140          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53153          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53155          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53157          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53159          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53173          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53176          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53177          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53178          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53179          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53181          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53183          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53201          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53204          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53218          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53267          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53270          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53275          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53278          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53280          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53283          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53293          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53296          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53299          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53309          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53312          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53314          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53317          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53320          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53322          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53338          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53340          
> CLOSE_WAIT  63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53364          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53366          
> ESTABLISHED 63729/java          
> tcp      260      0 10.17.202.19:19888          10.17.202.13:53367          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53380          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53382          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53386          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53390          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53392          
> ESTABLISHED 63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45301          
> CLOSE_WAIT  63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45303          
> CLOSE_WAIT  63729/java          
> tcp     1277      0 10.17.202.19:19888          10.17.202.18:45306          
> ESTABLISHED 63729/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to