[ https://issues.apache.org/jira/browse/HIVE-24569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Stamatis Zampetakis reassigned HIVE-24569: ------------------------------------------ > LLAP daemon leaks file descriptors/log4j appenders > -------------------------------------------------- > > Key: HIVE-24569 > URL: https://issues.apache.org/jira/browse/HIVE-24569 > Project: Hive > Issue Type: Bug > Components: llap > Affects Versions: 2.2.0 > Reporter: Stamatis Zampetakis > Assignee: Stamatis Zampetakis > Priority: Major > Fix For: 4.0.0 > > Attachments: llap-appender-gc-roots.png > > > With HIVE-9756 query logs in LLAP are directed to different files (file per > query) using a Log4j2 routing appender. Without a purge policy in place, > appenders are created dynamically by the routing appender, one for each > query, and remain in memory forever. The dynamic appenders write to files so > each appender holds to a file descriptor. > Further work HIVE-14224 has mitigated the issue by introducing a custom > purging policy (LlapRoutingAppenderPurgePolicy) which deletes the dynamic > appenders (and closes the respective files) when the query is completed > (org.apache.hadoop.hive.llap.daemon.impl.QueryTracker#handleLogOnQueryCompletion). > > However, in the presence of multiple threads appending to the logs there are > race conditions. In an internal Hive cluster the number of file descriptors > started going up approx one descriptor leaking per query. After some > debugging it turns out that one thread (running the > QueryTracker#handleLogOnQueryCompletion) signals that the query has finished > and thus the purge policy should get rid of the respective appender (and > close the file) while another (Task-Executor-0) attempts to append another > log message for the same query. The initial appender is closed after the > request from the query tracker but a new one is created to accomodate the > message from the task executor and the latter is never removed thus creating > a leak. > Similar leaks have been identified and fixed for HS2 with the most similar > one being that described > [here|https://issues.apache.org/jira/browse/HIVE-22753?focusedCommentId=17021041&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021041]. > > The problem relies on the timing of threads so it may not manifestate in all > versions between 2.2.0 and 4.0.0. Usually the leak can be seen either via > lsof (or other similar command) with the following output: > {noformat} > # 1494391 is the PID of the LLAP daemon process > ls -ltr /proc/1494391/fd > ... > lrwx------ 1 hive hadoop 64 Dec 24 12:08 978 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121724_66ce273d-54a9-4dcd-a9fb-20cb5691cef7-dag_1608659125567_0008_194.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 977 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121804_ce53eeb5-c73f-4999-b7a4-b4dd04d4e4de-dag_1608659125567_0008_197.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 974 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224122002_1693bd7d-2f0e-4673-a8d1-b7cb14a02204-dag_1608659125567_0008_204.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 989 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121909_6a56218f-06c7-4906-9907-4b6dd824b100-dag_1608659125567_0008_201.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 984 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121754_78ef49a0-bc23-478f-9a16-87fa25e7a287-dag_1608659125567_0008_196.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 983 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121855_e65b9ebf-b2ec-4159-9570-1904442b7048-dag_1608659125567_0008_200.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 981 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121818_e9051ae3-1316-46af-aabb-22c53ed2fda7-dag_1608659125567_0008_198.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 980 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121744_fcf37921-4351-4368-95ee-b5be2592d89a-dag_1608659125567_0008_195.log > lrwx------ 1 hive hadoop 64 Dec 24 12:08 979 -> > /hadoop/yarn/log/application_1608659125567_0006/container_e04_1608659125567_0006_01_000002/hive_20201224121837_e80c0024-f6bc-4b3c-85ed-5c0c85c55787-dag_1608659125567_0008_199.log > {noformat} > or in the heap dump with many appenders (in my case {{LlapWrappedAppender}}) > holding indirectly open file descriptors. -- This message was sent by Atlassian Jira (v8.3.4#803005)