[ 
https://issues.apache.org/jira/browse/HADOOP-18091?focusedWorklogId=718543&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-718543
 ]

ASF GitHub Bot logged work on HADOOP-18091:
-------------------------------------------

                Author: ASF GitHub Bot
            Created on: 01/Feb/22 10:29
            Start Date: 01/Feb/22 10:29
    Worklog Time Spent: 10m 
      Work Description: mukund-thakur commented on a change in pull request 
#3930:
URL: https://github.com/apache/hadoop/pull/3930#discussion_r796426549



##########
File path: 
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/impl/TestActiveAuditManagerThreadLeakage.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.audit.impl;
+
+import java.io.IOException;
+import java.lang.ref.WeakReference;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Consumer;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.impl.WeakReferenceThreadMap;
+import org.apache.hadoop.fs.s3a.audit.AuditSpanS3A;
+import org.apache.hadoop.fs.s3a.audit.MemoryHungryAuditor;
+import org.apache.hadoop.test.AbstractHadoopTestBase;
+
+import static 
org.apache.hadoop.fs.s3a.audit.S3AAuditConstants.AUDIT_SERVICE_CLASSNAME;
+import static 
org.apache.hadoop.fs.s3a.audit.impl.ActiveAuditManagerS3A.PRUNE_THRESHOLD;
+import static 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.emptyStatisticsStore;
+
+/**
+ * This test attempts to recreate the OOM problems of
+ * HADOOP-18091. S3A auditing leaks memory through ThreadLocal references
+ * it does this by creating a thread pool, then
+ * creates and destroys FS instances, with threads in
+ * the pool (but not the main JUnit test thread) creating
+ * audit spans.
+ *
+ * With a custom audit span with a large memory footprint,
+ * memory demands will be high, and if the closed instances
+ * don't get cleaned up, memory runs out.
+ * GCs are forced.
+ * It is critical no spans are created in the junit thread because they will
+ * last for the duration of the test JVM.
+ */
+@SuppressWarnings("ResultOfMethodCallIgnored")
+public class TestActiveAuditManagerThreadLeakage extends 
AbstractHadoopTestBase {
+  /**
+   * Logging.
+   */
+  private static final Logger LOG =
+      LoggerFactory.getLogger(TestActiveAuditManagerThreadLeakage.class);
+
+  /** how many managers to sequentially create. */
+  private static final int MANAGER_COUNT = 500;
+
+  /** size of long lived hread pool. */

Review comment:
       typo : thread pool.

##########
File path: 
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/auditing_architecture.md
##########
@@ -119,16 +119,81 @@ The auditor then creates and returns a span for the 
specific operation.
 The AuditManagerS3A will automatically activate the span returned by the 
auditor
 (i.e. assign it the thread local variable tracking the active span in each 
thread).
 
-### Memory Leakage through `ThreadLocal` use
+### Memory Leakage through `ThreadLocal` misuse
 
-This architecture contains a critical defect,
+The original implementation of the integration with the S3AFileSystem class
+contained a critical defect,
 [HADOOP-18091](https://issues.apache.org/jira/browse/HADOOP-18091) _S3A 
auditing leaks memory through ThreadLocal references_.
 
-The code was written assuming that when the `ActiveAuditManagerS3A` service is
-stopped, it's `ThreadLocal` fields would be freed.
-In fact, they are retained until the threads with references are terminated.
+The original code was written with the assumption that when the 
`ActiveAuditManagerS3A` service was
+garbage collected, references in its `ThreadLocal` field would be freed.
+In fact, they are retained until all threads with references are terminated.
+If any long-lived thread had performed an s3 operation which created a span,
+a reference back to the audit manager instance was created 
+*whose lifetime was that of the thread*
+
+In short-lived processes, and long-lived processes where a limited set of
+`S3AFileSystem` instances were reused, this had no adverse effect.
+Indeed, if the filesystem instances were retained in the cache until
+the process was shut down, there would be strong references to the owning
+`S3AFileSystem` instance anyway.
+
+Where it did have problems was when the following conditions were met
+1. Process was long-lived.
+2. Long-lived threads in the process invoked filesystem operations on `s3a://` 
URLs.
+3. Either `S3AFileSystem` instances were created repeatedly, rather than 
retrieved
+   from the cache of active instances.
+4. Or, after a query for a specific user was completed,
+   `FileSystem.closeAllForUGI(UserGroupInformation)` was invoked to remove all
+   cached FS instances of that user. 
+
+Conditions 1, 2 and 4 are exactly those which long-lived Hive services can
+encounter.
+
+_Auditing was disabled by default until a fix was implemented._
+
+The memory leak has been fixed using a new class 
`org.apache.hadoop.util.WeakReferenceMap`
+to store a map of thread IDs to active spans. When the S3A fileystem is closed,
+its audit manager service is stopped and all references to spans removed from 
the
+map of thread ID to span.
+
+Weak References are used for the map so that span references do not consume 
memory even if
+threads are terminated without resetting the span reference of that thread.
+There is therefore a theoretical risk that if a garbage collection takes place 
during
+execution of a spanned operation, the reference will be lost.
+
+This is not considered an issue as all bounded entry points into the S3A 
filesystem
+retain a strong reference to their audit span. 
+
+All entry points which return an object which can invoke s3 operations (input 
and output
+streams, list iterators, etc.) also retain a strong reference to their span, a 
reference
+they activate before performing S3 operations.
+A factory method is provided to demand-generate a new span if, somehow, these 
conditions
+are not met. The "unbounded span" is used here.
+Except in deployments where `fs.s3a.audit.reject.out.of.span.operations` is 
true,
+invoking S3 operations within the unbounded span are permitted.
+That option is set to `true` within S3A test suites.
+Therefore it is unlikely that any operations are invoked in unbounded spans 
except
+for the special case of copy operations invoked by the transfer manager 
threads. 
+Those are already ignored in the logging auditor, whose unbounded span ignores
+requests which `AWSRequestAnalyzer.isRequestNotAlwaysInSpan()` indicates
+may happen outside of a span.
+This is restricted to bucket location probes possibly performed by the SDK
+on instantiatio, and copy part/complete calls.
+
+
+```java
+  public static boolean
+      isRequestNotAlwaysInSpan(final Object request) {
+    return request instanceof CopyPartRequest
+        || request instanceof CompleteMultipartUploadRequest
+        || request instanceof GetBucketLocationRequest;
+  }
+```
+
+Auditing is still disabled by default for consistency with previous releases.

Review comment:
       I thought auditing is enabled as the default value is changed?

##########
File path: 
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/auditing_architecture.md
##########
@@ -149,6 +214,39 @@ Spans may be used on different thread from that which they 
were created.
 Spans MUST always use the values from the `currentAuditContext()` in the 
creation
 thread.
 
+#### Memory Usage of `CommonAuditContext`
+
+The `CommonAuditContext` map has a `ThreadLocal` field to store global
+information which is intended to span multiple operations across multiple
+filesystems, for example the MapReduce or Spark job ID, which is set
+in the S3A committers.
+
+Applications and Hadoop code MUST NOT attach context entries
+which directly or indirectly consumes lots of memory, as the life
+of that memory use will become that of the thread.
+
+Applications and Hadoop code SHOULD remove context entries when
+no-longer needed.
+
+If memory leakage is suspected here, set the log
+`org.apache.hadoop.fs.audit.CommonAuditContext` to `TRACE`
+to log the origin of operations which add log entries.
+
+This will produce a log entry whose strack trace will show the caller chain.f

Review comment:
       remove "f" from the end of line.

##########
File path: 
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/impl/TestActiveAuditManagerThreadLeakage.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.audit.impl;
+
+import java.io.IOException;
+import java.lang.ref.WeakReference;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Consumer;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.impl.WeakReferenceThreadMap;
+import org.apache.hadoop.fs.s3a.audit.AuditSpanS3A;
+import org.apache.hadoop.fs.s3a.audit.MemoryHungryAuditor;
+import org.apache.hadoop.test.AbstractHadoopTestBase;
+
+import static 
org.apache.hadoop.fs.s3a.audit.S3AAuditConstants.AUDIT_SERVICE_CLASSNAME;
+import static 
org.apache.hadoop.fs.s3a.audit.impl.ActiveAuditManagerS3A.PRUNE_THRESHOLD;
+import static 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.emptyStatisticsStore;
+
+/**
+ * This test attempts to recreate the OOM problems of
+ * HADOOP-18091. S3A auditing leaks memory through ThreadLocal references
+ * it does this by creating a thread pool, then
+ * creates and destroys FS instances, with threads in
+ * the pool (but not the main JUnit test thread) creating
+ * audit spans.
+ *
+ * With a custom audit span with a large memory footprint,
+ * memory demands will be high, and if the closed instances
+ * don't get cleaned up, memory runs out.
+ * GCs are forced.
+ * It is critical no spans are created in the junit thread because they will
+ * last for the duration of the test JVM.
+ */
+@SuppressWarnings("ResultOfMethodCallIgnored")
+public class TestActiveAuditManagerThreadLeakage extends 
AbstractHadoopTestBase {
+  /**
+   * Logging.
+   */
+  private static final Logger LOG =
+      LoggerFactory.getLogger(TestActiveAuditManagerThreadLeakage.class);
+
+  /** how many managers to sequentially create. */
+  private static final int MANAGER_COUNT = 500;
+
+  /** size of long lived hread pool. */
+  private static final int THREAD_COUNT = 20;
+  private ExecutorService workers;
+
+  /**
+   * count of prunings which have taken place in the manager lifecycle
+   * operations.
+   */
+  private int pruneCount;
+
+  /**
+   * As audit managers are created they are added to the list,
+   * so we can verify they get GC'd.
+   */
+  private final List<WeakReference<ActiveAuditManagerS3A>> auditManagers =
+      new ArrayList<>();
+
+  /**
+   * When the service is stopped, the span map is
+   * cleared immediately.
+   */
+  @Test
+  public void testSpanMapClearedInServiceStop() throws IOException {
+    try (ActiveAuditManagerS3A auditManager =
+             new ActiveAuditManagerS3A(emptyStatisticsStore())) {
+      auditManager.init(createMemoryHungryConfiguration());
+      auditManager.start();
+      auditManager.getActiveAuditSpan();
+      // get the span map
+      final WeakReferenceThreadMap<?> spanMap
+          = auditManager.getActiveSpanMap();
+      Assertions.assertThat(spanMap.size())
+          .describedAs("map size")
+          .isEqualTo(1);
+      auditManager.stop();
+      Assertions.assertThat(spanMap.size())
+          .describedAs("map size")
+          .isEqualTo(0);
+    }
+  }
+
+  @Test
+  public void testMemoryLeak() throws Throwable {
+    workers = Executors.newFixedThreadPool(THREAD_COUNT);
+    for (int i = 0; i < MANAGER_COUNT; i++) {
+      final long oneAuditConsumption = createAndTestOneAuditor();
+      LOG.info("manager {} memory retained {}", i, oneAuditConsumption);
+    }
+
+    // pruning must have taken place.
+    // that's somewhat implicit in the test not going OOM.
+    // but if memory allocation in test runs increase, it
+    // may cease to hold. in which case: create more
+    // audit managers
+    LOG.info("Totel prune count {}", pruneCount);
+
+    Assertions.assertThat(pruneCount)
+        .describedAs("Totel prune count")
+        .isNotZero();
+
+    // now count number of audit managers GC'd
+    // some must have been GC'd, showing that no other
+    // references are being retained internally.
+    Assertions.assertThat(auditManagers.stream()
+            .filter((r) -> r.get() == null)
+            .count())
+        .describedAs("number of audit managers garbage collected")
+        .isNotZero();
+  }
+
+  /**
+   * Create, use and then shutdown one auditor in a unique thread.
+   * @return memory consumed/released
+   */
+  private long createAndTestOneAuditor() throws Exception {
+    long original = Runtime.getRuntime().freeMemory();
+    ExecutorService factory = Executors.newSingleThreadExecutor();
+
+    pruneCount += factory.submit(this::createAuditorAndWorkers).get();
+
+    factory.shutdown();

Review comment:
       shutdown in finally block

##########
File path: 
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/impl/TestActiveAuditManagerThreadLeakage.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.audit.impl;
+
+import java.io.IOException;
+import java.lang.ref.WeakReference;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Consumer;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.impl.WeakReferenceThreadMap;
+import org.apache.hadoop.fs.s3a.audit.AuditSpanS3A;
+import org.apache.hadoop.fs.s3a.audit.MemoryHungryAuditor;
+import org.apache.hadoop.test.AbstractHadoopTestBase;
+
+import static 
org.apache.hadoop.fs.s3a.audit.S3AAuditConstants.AUDIT_SERVICE_CLASSNAME;
+import static 
org.apache.hadoop.fs.s3a.audit.impl.ActiveAuditManagerS3A.PRUNE_THRESHOLD;
+import static 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.emptyStatisticsStore;
+
+/**
+ * This test attempts to recreate the OOM problems of
+ * HADOOP-18091. S3A auditing leaks memory through ThreadLocal references
+ * it does this by creating a thread pool, then
+ * creates and destroys FS instances, with threads in
+ * the pool (but not the main JUnit test thread) creating
+ * audit spans.
+ *
+ * With a custom audit span with a large memory footprint,
+ * memory demands will be high, and if the closed instances
+ * don't get cleaned up, memory runs out.
+ * GCs are forced.
+ * It is critical no spans are created in the junit thread because they will
+ * last for the duration of the test JVM.
+ */
+@SuppressWarnings("ResultOfMethodCallIgnored")
+public class TestActiveAuditManagerThreadLeakage extends 
AbstractHadoopTestBase {
+  /**
+   * Logging.
+   */
+  private static final Logger LOG =
+      LoggerFactory.getLogger(TestActiveAuditManagerThreadLeakage.class);
+
+  /** how many managers to sequentially create. */
+  private static final int MANAGER_COUNT = 500;
+
+  /** size of long lived hread pool. */
+  private static final int THREAD_COUNT = 20;
+  private ExecutorService workers;
+
+  /**
+   * count of prunings which have taken place in the manager lifecycle
+   * operations.
+   */
+  private int pruneCount;
+
+  /**
+   * As audit managers are created they are added to the list,
+   * so we can verify they get GC'd.
+   */
+  private final List<WeakReference<ActiveAuditManagerS3A>> auditManagers =
+      new ArrayList<>();
+
+  /**
+   * When the service is stopped, the span map is
+   * cleared immediately.
+   */
+  @Test
+  public void testSpanMapClearedInServiceStop() throws IOException {
+    try (ActiveAuditManagerS3A auditManager =
+             new ActiveAuditManagerS3A(emptyStatisticsStore())) {
+      auditManager.init(createMemoryHungryConfiguration());
+      auditManager.start();
+      auditManager.getActiveAuditSpan();
+      // get the span map
+      final WeakReferenceThreadMap<?> spanMap
+          = auditManager.getActiveSpanMap();
+      Assertions.assertThat(spanMap.size())
+          .describedAs("map size")
+          .isEqualTo(1);
+      auditManager.stop();
+      Assertions.assertThat(spanMap.size())
+          .describedAs("map size")
+          .isEqualTo(0);
+    }
+  }
+
+  @Test
+  public void testMemoryLeak() throws Throwable {
+    workers = Executors.newFixedThreadPool(THREAD_COUNT);
+    for (int i = 0; i < MANAGER_COUNT; i++) {
+      final long oneAuditConsumption = createAndTestOneAuditor();
+      LOG.info("manager {} memory retained {}", i, oneAuditConsumption);
+    }
+
+    // pruning must have taken place.
+    // that's somewhat implicit in the test not going OOM.
+    // but if memory allocation in test runs increase, it
+    // may cease to hold. in which case: create more
+    // audit managers
+    LOG.info("Totel prune count {}", pruneCount);
+
+    Assertions.assertThat(pruneCount)
+        .describedAs("Totel prune count")

Review comment:
       same typo

##########
File path: 
hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/auditing_architecture.md
##########
@@ -119,16 +119,81 @@ The auditor then creates and returns a span for the 
specific operation.
 The AuditManagerS3A will automatically activate the span returned by the 
auditor
 (i.e. assign it the thread local variable tracking the active span in each 
thread).
 
-### Memory Leakage through `ThreadLocal` use
+### Memory Leakage through `ThreadLocal` misuse
 
-This architecture contains a critical defect,
+The original implementation of the integration with the S3AFileSystem class
+contained a critical defect,
 [HADOOP-18091](https://issues.apache.org/jira/browse/HADOOP-18091) _S3A 
auditing leaks memory through ThreadLocal references_.
 
-The code was written assuming that when the `ActiveAuditManagerS3A` service is
-stopped, it's `ThreadLocal` fields would be freed.
-In fact, they are retained until the threads with references are terminated.
+The original code was written with the assumption that when the 
`ActiveAuditManagerS3A` service was
+garbage collected, references in its `ThreadLocal` field would be freed.
+In fact, they are retained until all threads with references are terminated.
+If any long-lived thread had performed an s3 operation which created a span,
+a reference back to the audit manager instance was created 
+*whose lifetime was that of the thread*
+
+In short-lived processes, and long-lived processes where a limited set of
+`S3AFileSystem` instances were reused, this had no adverse effect.
+Indeed, if the filesystem instances were retained in the cache until
+the process was shut down, there would be strong references to the owning
+`S3AFileSystem` instance anyway.
+
+Where it did have problems was when the following conditions were met
+1. Process was long-lived.
+2. Long-lived threads in the process invoked filesystem operations on `s3a://` 
URLs.
+3. Either `S3AFileSystem` instances were created repeatedly, rather than 
retrieved
+   from the cache of active instances.
+4. Or, after a query for a specific user was completed,
+   `FileSystem.closeAllForUGI(UserGroupInformation)` was invoked to remove all
+   cached FS instances of that user. 
+
+Conditions 1, 2 and 4 are exactly those which long-lived Hive services can
+encounter.
+
+_Auditing was disabled by default until a fix was implemented._
+
+The memory leak has been fixed using a new class 
`org.apache.hadoop.util.WeakReferenceMap`
+to store a map of thread IDs to active spans. When the S3A fileystem is closed,
+its audit manager service is stopped and all references to spans removed from 
the
+map of thread ID to span.
+
+Weak References are used for the map so that span references do not consume 
memory even if
+threads are terminated without resetting the span reference of that thread.
+There is therefore a theoretical risk that if a garbage collection takes place 
during
+execution of a spanned operation, the reference will be lost.
+
+This is not considered an issue as all bounded entry points into the S3A 
filesystem
+retain a strong reference to their audit span. 
+
+All entry points which return an object which can invoke s3 operations (input 
and output
+streams, list iterators, etc.) also retain a strong reference to their span, a 
reference
+they activate before performing S3 operations.
+A factory method is provided to demand-generate a new span if, somehow, these 
conditions
+are not met. The "unbounded span" is used here.
+Except in deployments where `fs.s3a.audit.reject.out.of.span.operations` is 
true,
+invoking S3 operations within the unbounded span are permitted.
+That option is set to `true` within S3A test suites.
+Therefore it is unlikely that any operations are invoked in unbounded spans 
except
+for the special case of copy operations invoked by the transfer manager 
threads. 
+Those are already ignored in the logging auditor, whose unbounded span ignores
+requests which `AWSRequestAnalyzer.isRequestNotAlwaysInSpan()` indicates
+may happen outside of a span.
+This is restricted to bucket location probes possibly performed by the SDK
+on instantiatio, and copy part/complete calls.

Review comment:
       nit: typo -> instantiation

##########
File path: 
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/impl/TestActiveAuditManagerThreadLeakage.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.audit.impl;
+
+import java.io.IOException;
+import java.lang.ref.WeakReference;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Consumer;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.impl.WeakReferenceThreadMap;
+import org.apache.hadoop.fs.s3a.audit.AuditSpanS3A;
+import org.apache.hadoop.fs.s3a.audit.MemoryHungryAuditor;
+import org.apache.hadoop.test.AbstractHadoopTestBase;
+
+import static 
org.apache.hadoop.fs.s3a.audit.S3AAuditConstants.AUDIT_SERVICE_CLASSNAME;
+import static 
org.apache.hadoop.fs.s3a.audit.impl.ActiveAuditManagerS3A.PRUNE_THRESHOLD;
+import static 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.emptyStatisticsStore;
+
+/**
+ * This test attempts to recreate the OOM problems of
+ * HADOOP-18091. S3A auditing leaks memory through ThreadLocal references
+ * it does this by creating a thread pool, then
+ * creates and destroys FS instances, with threads in
+ * the pool (but not the main JUnit test thread) creating
+ * audit spans.
+ *
+ * With a custom audit span with a large memory footprint,
+ * memory demands will be high, and if the closed instances
+ * don't get cleaned up, memory runs out.
+ * GCs are forced.
+ * It is critical no spans are created in the junit thread because they will
+ * last for the duration of the test JVM.
+ */
+@SuppressWarnings("ResultOfMethodCallIgnored")
+public class TestActiveAuditManagerThreadLeakage extends 
AbstractHadoopTestBase {
+  /**
+   * Logging.
+   */
+  private static final Logger LOG =
+      LoggerFactory.getLogger(TestActiveAuditManagerThreadLeakage.class);
+
+  /** how many managers to sequentially create. */
+  private static final int MANAGER_COUNT = 500;
+
+  /** size of long lived hread pool. */
+  private static final int THREAD_COUNT = 20;
+  private ExecutorService workers;

Review comment:
       shutdown the workers?

##########
File path: 
hadoop-tools/hadoop-aws/src/test/java/org/apache/hadoop/fs/s3a/audit/impl/TestActiveAuditManagerThreadLeakage.java
##########
@@ -0,0 +1,394 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hadoop.fs.s3a.audit.impl;
+
+import java.io.IOException;
+import java.lang.ref.WeakReference;
+import java.util.ArrayList;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+import java.util.concurrent.ExecutionException;
+import java.util.concurrent.ExecutorService;
+import java.util.concurrent.Executors;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.function.Consumer;
+
+import org.assertj.core.api.Assertions;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.impl.WeakReferenceThreadMap;
+import org.apache.hadoop.fs.s3a.audit.AuditSpanS3A;
+import org.apache.hadoop.fs.s3a.audit.MemoryHungryAuditor;
+import org.apache.hadoop.test.AbstractHadoopTestBase;
+
+import static 
org.apache.hadoop.fs.s3a.audit.S3AAuditConstants.AUDIT_SERVICE_CLASSNAME;
+import static 
org.apache.hadoop.fs.s3a.audit.impl.ActiveAuditManagerS3A.PRUNE_THRESHOLD;
+import static 
org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.emptyStatisticsStore;
+
+/**
+ * This test attempts to recreate the OOM problems of
+ * HADOOP-18091. S3A auditing leaks memory through ThreadLocal references
+ * it does this by creating a thread pool, then
+ * creates and destroys FS instances, with threads in
+ * the pool (but not the main JUnit test thread) creating
+ * audit spans.
+ *
+ * With a custom audit span with a large memory footprint,
+ * memory demands will be high, and if the closed instances
+ * don't get cleaned up, memory runs out.
+ * GCs are forced.
+ * It is critical no spans are created in the junit thread because they will
+ * last for the duration of the test JVM.
+ */
+@SuppressWarnings("ResultOfMethodCallIgnored")
+public class TestActiveAuditManagerThreadLeakage extends 
AbstractHadoopTestBase {
+  /**
+   * Logging.
+   */
+  private static final Logger LOG =
+      LoggerFactory.getLogger(TestActiveAuditManagerThreadLeakage.class);
+
+  /** how many managers to sequentially create. */
+  private static final int MANAGER_COUNT = 500;
+
+  /** size of long lived hread pool. */
+  private static final int THREAD_COUNT = 20;
+  private ExecutorService workers;
+
+  /**
+   * count of prunings which have taken place in the manager lifecycle
+   * operations.
+   */
+  private int pruneCount;
+
+  /**
+   * As audit managers are created they are added to the list,
+   * so we can verify they get GC'd.
+   */
+  private final List<WeakReference<ActiveAuditManagerS3A>> auditManagers =
+      new ArrayList<>();
+
+  /**
+   * When the service is stopped, the span map is
+   * cleared immediately.
+   */
+  @Test
+  public void testSpanMapClearedInServiceStop() throws IOException {
+    try (ActiveAuditManagerS3A auditManager =
+             new ActiveAuditManagerS3A(emptyStatisticsStore())) {
+      auditManager.init(createMemoryHungryConfiguration());
+      auditManager.start();
+      auditManager.getActiveAuditSpan();
+      // get the span map
+      final WeakReferenceThreadMap<?> spanMap
+          = auditManager.getActiveSpanMap();
+      Assertions.assertThat(spanMap.size())
+          .describedAs("map size")
+          .isEqualTo(1);
+      auditManager.stop();
+      Assertions.assertThat(spanMap.size())
+          .describedAs("map size")
+          .isEqualTo(0);
+    }
+  }
+
+  @Test
+  public void testMemoryLeak() throws Throwable {
+    workers = Executors.newFixedThreadPool(THREAD_COUNT);
+    for (int i = 0; i < MANAGER_COUNT; i++) {
+      final long oneAuditConsumption = createAndTestOneAuditor();
+      LOG.info("manager {} memory retained {}", i, oneAuditConsumption);
+    }
+
+    // pruning must have taken place.
+    // that's somewhat implicit in the test not going OOM.
+    // but if memory allocation in test runs increase, it
+    // may cease to hold. in which case: create more
+    // audit managers
+    LOG.info("Totel prune count {}", pruneCount);

Review comment:
       typo in totel?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 718543)
    Time Spent: 5h 50m  (was: 5h 40m)

> S3A auditing leaks memory through ThreadLocal references
> --------------------------------------------------------
>
>                 Key: HADOOP-18091
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18091
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs/s3
>    Affects Versions: 3.3.2
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 5h 50m
>  Remaining Estimate: 0h
>
> {{ActiveAuditManagerS3A}} uses thread locals to map to active audit spans, 
> which (because they are wrapped) include back reference to the audit manager 
> instance and the config it was created with.
> these *do not* get cleaned up when the FS instance is closed.
> if you have a long lived process creating and destroying many FS instances, 
> then memory gets used up. l



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

Reply via email to