Thanks, that's what I'll do in the meantime. Appreciate your help On Tue, Aug 27, 2024 at 10:19 AM Dave Marion <[email protected]> wrote:
> Restarting the secondary GC processes is likely the easiest thing to do. > If you can't identify them, then you should be able to restart all of the > GC processes. Accumulo can operate without the GC process for some period > of time, but it's advised to keep it running. > > On 2024/08/27 12:48:21 Craig Portoghese wrote: > > Thanks Dave! Are there any mitigations I can employ to work around this > > until 2.1.4 is released? I suppose on the standby servers I can schedule > a > > cronjob to restart the GC process every few hours. I'm not familiar with > > how long Accumulo can operate without a GC in general, so maybe that's > > something I should test for my particular database size/use. > > > > On Mon, Aug 26, 2024 at 1:39 PM Dave Marion <[email protected]> wrote: > > > > > Thanks for reporting this. Based on the information you provided I was > > > able to create https://github.com/apache/accumulo/pull/4838. It > appears > > > that the Manager, Monitor, and SimpleGarbageCollector are creating > multiple > > > instances of ServiceLock when in a loop waiting to acquire the lock > (when > > > they are the standby node). The ServiceLock constructor creates a > Watcher > > > in the ZooKeeper client, which is likely causing the problem you are > > > having. The Manager and Monitor operate a little differently and thus > do > > > not exhibit the same OOME problem. > > > > > > On 2024/08/26 12:13:50 Craig Portoghese wrote: > > > > Wasn't sure if this was bug territory or an issue with cluster > > > > configuration. > > > > > > > > In my dev environment, I have a 5-server AWS EMR cluster using > Accumulo > > > > 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high > > > > availability mode so there are 3 primary nodes with Zookeeper > running. On > > > > the primary nodes I run the manager, monitor, and gc processes. On > the 2 > > > > core nodes (with DataNode on them) I run just tablet servers. > > > > > > > > The manager and monitor processes on the 2nd and 3rd servers are > fine, no > > > > problems about not being the leader for their process. However, the > 2nd > > > and > > > > 3rd GC processes will repeatedly complain in a DEBUG "Failed to > acquire > > > > lock". It will complain that there is already a gc lock, and then > create > > > an > > > > ephemeral node #0000000001, then #0000000002, etc. After about 8 > hours of > > > > this complaint loop, it will turn into an error "Called > > > > determineLockOwnership() when ephemeralNodeName == null", which it > spams > > > > forever, filling up the server and eventually killing the server. > > > > > > > > This has happened in multiple environments. Is it an issue with GC's > > > > ability to hold elections? Should I be putting the standby GC > processes > > > on > > > > a different node than the one running one of the zookeepers? Below > are > > > > samples of the two log types: > > > > > > > > 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to > > > > acquire ZooKeeper lock for garbage collector > > > > 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer > > > > ThriftMetrics initialize > > > > 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating > unsecure > > > > custom half-async Thrift server > > > > 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting > > > garbage > > > > collector listening on coreNode1.example.domain:9998 > > > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: > > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node > > > > > > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > > > > created > > > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG: > > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on > > > > > > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > > > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another > > > process > > > > with ephemeral node: > > > zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 > > > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on > prior > > > > node > > > > > > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000 > > > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG: > > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock > in > > > > tryLock(), deleting all at path: > > > > > > > > /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057 > > > > 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to > get > > > GC > > > > ZooKeeper lock, will retry > > > > > > > > 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while > calling > > > > watcher > > > > java.lang.IllegalStateException: Called determineLockOwnership() when > > > > ephemeralNodeName == null > > > > at > > > > > > > > org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274) > > > > ~[accumulo-core-2.1.2.jar:2.1.2] > > > > at > > > > > > > > org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354) > > > > ~[accumulo-core-2.1.2.jar:2.1.2] > > > > at > > > > > > > > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532) > > > > ~[zookeeper-3.5.10.jar:3.5.10--1] > > > > at > > > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507) > > > > ~[zookeeper-3.5.10.jar:3.5.10--1] > > > > > > > > > >
