[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258190#comment-15258190
 ] 

Jason Lowe commented on MAPREDUCE-6684:
---------------------------------------

We need to get to the bottom of why the lock is being held so long. The move 
itself is done via an executor pool so that shouldn't be slow.

According to the jstacks, every single one has this stacktrace for the thread 
that has the lock on the user's intermediate directory:
{noformat}
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.didMoveFail(HistoryFileManager.java:356)
        - waiting to lock <0x000000056106dd80> (a 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo)
        at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:873)
        at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.access$200(HistoryFileManager.java:82)
        at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir.scanIfNeeded(HistoryFileManager.java:315)
        - locked <0x00000004fae96ee0> (a 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$UserLogDir)
        at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.scanIntermediateDirectory(HistoryFileManager.java:839)
        at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager.getFileInfo(HistoryFileManager.java:978)
        at 
org.apache.hadoop.mapreduce.v2.hs.CachedHistoryStorage.loadJob(CachedHistoryStorage.java:174)
[...]
{noformat}

So the thread has the lock, but it's waiting a long time on some other thread 
which isn't ideal.  Every jstack shows the other thread doing this:
{noformat}
"473394034@qtp-1617102129-245" daemon prio=10 tid=0x00007f66c8185000 nid=0x6437 
runnable [0x00007f6693b6c000]
   java.lang.Thread.State: RUNNABLE
      at java.util.HashMap.put(HashMap.java:494)
      at 
org.apache.avro.util.WeakIdentityHashMap.put(WeakIdentityHashMap.java:108)
      at 
org.apache.avro.LogicalTypes.fromSchemaIgnoreInvalid(LogicalTypes.java:62)
      at org.apache.avro.Schema.parse(Schema.java:1318)
      at org.apache.avro.Schema.parse(Schema.java:1260)
      at org.apache.avro.Schema.parse(Schema.java:1331)
      at org.apache.avro.Schema.parse(Schema.java:1260)
      at org.apache.avro.Schema$Parser.parse(Schema.java:1024)
      at org.apache.avro.Schema$Parser.parse(Schema.java:1012)
      at org.apache.avro.Schema.parse(Schema.java:1064)
      at 
org.apache.hadoop.mapreduce.jobhistory.EventReader.<init>(EventReader.java:71)
      at 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser.parse(JobHistoryParser.java:139)
      - locked <0x00000005a05fd4e8> (a 
org.apache.hadoop.mapreduce.jobhistory.JobHistoryParser)
      at 
org.apache.hadoop.mapreduce.v2.hs.CompletedJob.loadFullHistoryData(CompletedJob.java:339)
      - locked <0x0000000577bd9b98> (a 
org.apache.hadoop.mapreduce.v2.hs.CompletedJob)
      at 
org.apache.hadoop.mapreduce.v2.hs.CompletedJob.<init>(CompletedJob.java:101)
      at 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo.loadJob(HistoryFileManager.java:451)
      - locked <0x000000056106dd80> (a 
org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager$HistoryFileInfo)
[...]
{noformat}

So I don't think any of the above three proposals will fix the issue 
completely.  The core issue is the scan being held up for an arbitrarily long 
jhist parse.  I suspect things will get a _lot_ better if we can decouple these 
two processes.  I haven't gone through all the race conditions, but I'm 
wondering if we can simply make the state field volatile and allow the 
didMoveFail and isMovePending methods to avoid grabbing the lock.

> High contention on scanning of user directory under immediate_done in Job 
> History Server
> ----------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6684
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6684
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobhistoryserver
>    Affects Versions: 2.7.0
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>         Attachments: jhs-jstacks-service-monitor-running.tar.gz, 
> jhs-jstacks-service-monitor-stopped.tar.gz
>
>
> HistoryFileManager.scanIntermediateDirectory() in JHS acquires a lock on each 
> user directory it tries to scan (move or delete files under the user 
> directory as necessary). This method is called in a thread in JobHistory that 
> performs periodical scanning of intermediate directory, and can also be 
> called by web server threads for each Web API call made by a JHS client. In 
> cases where there are many concurrent Web API calls/connections to JHS, all 
> but one thread are blocked on the lock on the user directory. Eventually, 
> client connects will time out, but the threads in JHS will not be killed and 
> leave a lot of TCP connections in CLOSE_WAIT state. 
> {noformat}
> [systest@vb1120 ~]$ sudo netstat -nap | grep 63729 | sort -k 4
> tcp        0      0 10.17.202.19:10020          0.0.0.0:*                   
> LISTEN      63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.198.30:33010          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.200.30:33980          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.10:59625          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10020          10.17.202.13:35765          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:10033          0.0.0.0:*                   
> LISTEN      63729/java          
> tcp        0      0 10.17.202.19:19888          0.0.0.0:*                   
> LISTEN      63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:35103          
> ESTABLISHED 63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:43670          
> ESTABLISHED 63729/java          
> tcp        0      0 10.17.202.19:19888          10.17.198.30:45453          
> ESTABLISHED 63729/java          
> tcp      277      0 10.17.202.19:19888          10.17.198.30:49184          
> ESTABLISHED 63729/java          
> tcp        1      0 10.17.202.19:19888          10.17.202.13:49992          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52703          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52707          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52708          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52710          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52714          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52723          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52726          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52727          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52739          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52749          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52753          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52757          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52760          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52820          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52827          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52829          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52831          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52833          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52836          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52839          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52841          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52843          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52850          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52860          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52876          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52879          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52881          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52884          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52886          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52888          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52891          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52893          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52896          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52898          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52899          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52902          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52909          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52910          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52912          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52923          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52925          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52927          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52930          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52937          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52939          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52945          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52947          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52969          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:52972          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:52975          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53004          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53007          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53009          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53011          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53052          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53058          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53059          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53063          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53071          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53084          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53093          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53095          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53097          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53101          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53104          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53106          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53108          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53110          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53112          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53114          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53115          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53117          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53121          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53123          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53125          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53127          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53129          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53131          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53134          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53138          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53140          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53153          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53155          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53157          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53159          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53173          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53176          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53177          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53178          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53179          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53181          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53183          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53201          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53204          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53218          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53267          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53270          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53275          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53278          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53280          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53283          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53293          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53296          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53299          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53309          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53312          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53314          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53317          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53320          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53322          
> CLOSE_WAIT  63729/java          
> tcp      256      0 10.17.202.19:19888          10.17.202.13:53338          
> CLOSE_WAIT  63729/java          
> tcp      261      0 10.17.202.19:19888          10.17.202.13:53340          
> CLOSE_WAIT  63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53364          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53366          
> ESTABLISHED 63729/java          
> tcp      260      0 10.17.202.19:19888          10.17.202.13:53367          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53380          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53382          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53386          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53390          
> ESTABLISHED 63729/java          
> tcp      255      0 10.17.202.19:19888          10.17.202.13:53392          
> ESTABLISHED 63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45301          
> CLOSE_WAIT  63729/java          
> tcp     1278      0 10.17.202.19:19888          10.17.202.18:45303          
> CLOSE_WAIT  63729/java          
> tcp     1277      0 10.17.202.19:19888          10.17.202.18:45306          
> ESTABLISHED 63729/java 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to