[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

Siddharth Ahuja (Jira) Tue, 31 Mar 2020 00:35:53 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17071542#comment-17071542
 ]


Siddharth Ahuja commented on YARN-10207:
----------------------------------------

Hi [~adam.antal], thanks for your comments.

The leak happens when AggregatedLogFormat.LogReader fails during instantiation 
inside AggregatedLogFormat.java, specifically, when TFile.Reader creation 
within the AggregatedLogFormat.LogReader's constructor fails due to a corrupted 
file passed in (see above stacktrace).

The fact that FSDataInputStream is not closed out causes the leak.

The caller - TFileAggregatedLogsBlock.render(…) does try to cleanup the reader 
in the finally clause (see 
https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L153),
 however, it assumes that the reader would have been created successfully. 
However, in our case, the reader never manages to get created because it fails 
during construction phase itself due to a corrupted log.

The fix, therefore, is to catch any IO Exceptions within 
AggregatedLogFormat.LogReader itself inside the constructor, perform a close of 
all the relevant entities including FSDataInputStream if we do indeed hit any 
and throw the exception back to the caller (TFileAggregatedLogsBlock.render) so 
that it is able to catch it and log it 
(https://github.com/apache/hadoop/blob/460ba7fb14114f44e14a660f533f32c54e504478/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/tfile/TFileAggregatedLogsBlock.java#L150).

This ensures that we don't leak connections etc. wherever the reader fails to 
instantiate (=new AggregatedLogFormat.LogReader).

Based on your feedback, I performed functional testing with IndexedFormat 
(IFile) by setting the following properties inside yarn-site.xml:
{code}
    <property>
        <name>yarn.log-aggregation.file-formats</name>
        <value>IndexedFormat</value>
    </property>
    <property>
        <name>yarn.log-aggregation.file-controller.IndexedFormat.class</name>
        
<value>org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController</value>
    </property>
    <property>
        <name>yarn.log-aggregation.IndexedFormat.remote-app-log-dir</name>
        <value>/tmp/ifilelogs</value>
    </property>
    <property>
        
<name>yarn.log-aggregation.IndexedFormat.remote-app-log-dir-suffix</name>
        <value>ifilelogs</value>
    </property>
{code}

Like the earlier scenario, I corrupted the Ifile (aggregated log in HDFS) and 
tried to render it in JHS Web UI, however, no leaks were found for this case. 

The call happens in this fashion:

IndexedFileAggregatedLogsBlock.render() -> 
LogAggregationIndexedFileController.loadIndexedLogsMeta(…)

IOException is encountered inside this try block, however, notice the finally 
clause here -> 
https://github.com/apache/hadoop/blob/4af2556b48e01150851c7f273a254a16324ba843/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/LogAggregationIndexedFileController.java#L900.
 This helps cleaning up the socket connection by closing out FSDataInputStream. 
 

You will notice that this is a different call stack to the TFile case as we 
don't have a call to AggregatedLogFormat.LogReader i.e. it is coded differently.
Regardless, thanks to that finally clause, it does end up cleaning the 
connection and there are no CLOSE_WAIT leaks in case of a corrupted log file 
being encountered. (Bad thing here is that only a WARN log is presented to the 
user in the JHS logs in case of rendering failing for Tfile logs and there is 
no stacktrace logged coming from the exception here - 
https://github.com/apache/hadoop/blob/c24af4b0d6fc32938b076161b5a8c86d38e3e0a1/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/logaggregation/filecontroller/ifile/IndexedFileAggregatedLogsBlock.java#L136
 as the exception is just swallowed up inside the catch{} clause. This may 
warrant a separate JIRA.)

As part of this fix, I looked for any occurrences of "new TFile.Reader" that 
may cause connection leaks somewhere else. I found two :
# TFileDumper, see 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L103,
 and,
# FileSystemApplicationHistoryStore, see 
https://github.com/apache/hadoop/blob/7dac7e1d13eaf0eac04fe805c7502dcecd597979/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/applicationhistoryservice/FileSystemApplicationHistoryStore.java#L691

1 is not an issue because FSDataInputStream is getting closed inside finally{} 
clause here: 
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/file/tfile/TFileDumper.java#L293

I had concerns regarding 2 so I performed functional testing by adding the 
below settings in yarn-site.xml:

{code}
<property>
  <name>yarn.timeline-service.generic-application-history.enabled</name>
  <value>true</value>
</property>

<property>
   <name>yarn.timeline-service.generic-application-history.store-class</name>
   
<value>org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore</value>
</property>
 {code}

ApplicationHistoryServer started sucessfully. I corrupted a log file in HDFS 
and got errors similar to below:

{code}
2020-03-30 23:20:58,405 ERROR 
applicationhistoryservice.ApplicationHistoryClientService 
(ApplicationHistoryClientService.java:getApplicationReport(178)) - Not a valid 
BCFile.
java.io.IOException: Not a valid BCFile.
        at 
org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
        at org.apache.hadoop.io.file.tfile.BCFile$Reader.<init>(BCFile.java:628)
        at org.apache.hadoop.io.file.tfile.TFile$Reader.<init>(TFile.java:804)
        at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore$HistoryFileReader.<init>(FileSystemApplicationHistoryStore.java:693)
        at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileReader(FileSystemApplicationHistoryStore.java:669)
        at 
org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getApplication(FileSystemApplicationHistoryStore.java:152)
        at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryManagerImpl.getApplication(ApplicationHistoryManagerImpl.java:118)
        at 
org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryClientService.getApplicationReport(ApplicationHistoryClientService.java:175)
…
{code}

however, did not find CLOSE_WAITS on the ApplicationHistoryServer process. 
>From code inspection it wasn't obvious to me as to why there weren't any leaks 
but then again I am not well-versed with AHS so this could be taken up in a 
separate JIRA by someone who has more experience with this component. 

However, for the issues discussed in this specific JIRA, the attached patch 
fixes the issue. I will let you review it.


> CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated 
> logs on the JobHistoryServer Web UI
> -----------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-10207
>                 URL: https://issues.apache.org/jira/browse/YARN-10207
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>            Reporter: Siddharth Ahuja
>            Assignee: Siddharth Ahuja
>            Priority: Major
>         Attachments: YARN-10207.001.patch
>
>
> File descriptor leaks are observed coming from the JobHistoryServer process 
> while it tries to render a "corrupted" aggregated log on the JHS Web UI.
> Issue reproduced using the following steps:
> # Ran a sample Hadoop MR Pi job, it had the id - 
> application_1582676649923_0026.
> # Copied an aggregated log file from HDFS to local FS:
> {code}
> hdfs dfs -get 
> /tmp/logs/systest/logs/application_1582676649923_0026/<nm_hostname>_8041
> {code}
> # Updated the TFile metadata at the bottom of this file with some junk to 
> corrupt the file :
> *Before:*
> {code}
>                       
> ^@^GVERSION*(^@&container_1582676649923_0026_01_000003^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáP
> {code}
> *After:*
> {code}                
>                       
> ^@^GVERSION*(^@&container_1582676649923_0026_01_000003^F^Dnone^A^Pª5²ª5²^C^Qdata:BCFile.index^Dnoneª5þ^M^M^Pdata:TFile.index^Dnoneª5È66^Odata:TFile.meta^Dnoneª5Â^F^F^@^@^@^@^@^B6^K^@^A^@^@Ñ^QÓh<91>µ×¶9ßA@<92>ºáPblah
> {code}
> Notice "blah" (junk) added at the very end.
> # Remove the existing aggregated log file that will need to be replaced by 
> our modified copy from step 3 (as otherwise HDFS will prevent it from placing 
> the file with the same name as it already exists):
> {code}
> hdfs dfs -rm -r -f 
> /tmp/logs/systest/logs/application_1582676649923_0026/<nm_hostname>_8041
> {code}
> # Upload the corrupted aggregated file back to HDFS:
> {code}
> hdfs dfs -put <nm_hostname>_8041 
> /tmp/logs/systest/logs/application_1582676649923_0026
> {code}
> # Visit HistoryServer Web UI
> # Click on job_1582676649923_0026
> # Click on "logs" link against the AM (assuming the AM ran on nm_hostname)
> # Review the JHS logs, following exception will be seen:
> {code}
>       2020-03-24 20:03:48,484 ERROR org.apache.hadoop.yarn.webapp.View: Error 
> getting logs for job_1582676649923_0026
>       java.io.IOException: Not a valid BCFile.
>               at 
> org.apache.hadoop.io.file.tfile.BCFile$Magic.readAndVerify(BCFile.java:927)
>               at 
> org.apache.hadoop.io.file.tfile.BCFile$Reader.<init>(BCFile.java:628)
>               at 
> org.apache.hadoop.io.file.tfile.TFile$Reader.<init>(TFile.java:804)
>               at 
> org.apache.hadoop.yarn.logaggregation.AggregatedLogFormat$LogReader.<init>(AggregatedLogFormat.java:588)
>               at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.TFileAggregatedLogsBlock.render(TFileAggregatedLogsBlock.java:111)
>               at 
> org.apache.hadoop.yarn.logaggregation.filecontroller.tfile.LogAggregationTFileController.renderAggregatedLogsBlock(LogAggregationTFileController.java:341)
>               at 
> org.apache.hadoop.yarn.webapp.log.AggregatedLogsBlock.render(AggregatedLogsBlock.java:117)
>               at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.render(HtmlBlock.java:69)
>               at 
> org.apache.hadoop.yarn.webapp.view.HtmlBlock.renderPartial(HtmlBlock.java:79)
>               at org.apache.hadoop.yarn.webapp.View.render(View.java:235)
>               at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage$Page.subView(HtmlPage.java:49)
>               at 
> org.apache.hadoop.yarn.webapp.hamlet2.HamletImpl$EImp._v(HamletImpl.java:117)
>               at 
> org.apache.hadoop.yarn.webapp.hamlet2.Hamlet$TD.__(Hamlet.java:848)
>               at 
> org.apache.hadoop.yarn.webapp.view.TwoColumnLayout.render(TwoColumnLayout.java:71)
>               at 
> org.apache.hadoop.yarn.webapp.view.HtmlPage.render(HtmlPage.java:82)
>               at 
> org.apache.hadoop.yarn.webapp.Controller.render(Controller.java:212)
>               at 
> org.apache.hadoop.mapreduce.v2.hs.webapp.HsController.logs(HsController.java:202)
>               at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>               at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>               at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>               at java.lang.reflect.Method.invoke(Method.java:498)
>               at 
> org.apache.hadoop.yarn.webapp.Dispatcher.service(Dispatcher.java:162)
>               at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>               at 
> com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:287)
>               at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:277)
>               at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:182)
>               at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
>               at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
>               at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:941)
>               at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:875)
>               at 
> com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:829)
>               at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
>               at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:119)
>               at 
> com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:133)
>               at 
> com.google.inject.servlet.GuiceFilter$1.call(GuiceFilter.java:130)
>               at 
> com.google.inject.servlet.GuiceFilter$Context.call(GuiceFilter.java:203)
>               at 
> com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:130)
>               at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
>               at 
> org.apache.hadoop.security.http.XFrameOptionsFilter.doFilter(XFrameOptionsFilter.java:57)
>               at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
>               at 
> org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:110)
>               at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
>               at 
> org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1553)
>               at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
>               at 
> org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
>               at 
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
>               at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
>               at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>               at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
>               at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
>               at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1180)
>               at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:513)
>               at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>               at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1112)
>               at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>               at 
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:119)
>               at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
>               at org.eclipse.jetty.server.Server.handle(Server.java:539)
>               at 
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:333)
>               at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:251)
>               at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>               at 
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>               at 
> org.eclipse.jetty.io.ssl.SslConnection.onFillable(SslConnection.java:259)
>               at 
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:283)
>               at 
> org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:108)
>               at 
> org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
>               at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
>               at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
>               at 
> org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
>               at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
>               at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
>               at java.lang.Thread.run(Thread.java:748)
> {code}
> # Run lsof -p 13224 | grep CLOSE_WAIT
> {code}
> java    13224 mapred  231u     IPv4          138429719       0t0       TCP 
> nm_hostname:53040-><datanode_hostname>:9866 (CLOSE_WAIT)
> {code}
> where 9866 is the Datanode port in my cluster and 13224 is the 
> JobHistoryServer process id.
>       
> As long as I repeat 6,7,8 (in order and collectively) the CLOSE_WAIT 
> instances keep increasing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10207) CLOSE_WAIT socket connection leaks during rendering of (corrupted) aggregated logs on the JobHistoryServer Web UI

Reply via email to