Re: Issue with second/third GC processes in a cluster error spam/OoM

Craig Portoghese Tue, 27 Aug 2024 08:18:41 -0700

Thanks, that's what I'll do in the meantime. Appreciate your help

On Tue, Aug 27, 2024 at 10:19 AM Dave Marion <[email protected]> wrote:


> Restarting the secondary GC processes is likely the easiest thing to do.
> If you can't identify them, then you should be able to restart all of the
> GC processes. Accumulo can operate without the GC process for some period
> of time, but it's advised to keep it running.
>
> On 2024/08/27 12:48:21 Craig Portoghese wrote:
> > Thanks Dave! Are there any mitigations I can employ to work around this
> > until 2.1.4 is released? I suppose on the standby servers I can schedule
> a
> > cronjob to restart the GC process every few hours. I'm not familiar with
> > how long Accumulo can operate without a GC in general, so maybe that's
> > something I should test for my particular database size/use.
> >
> > On Mon, Aug 26, 2024 at 1:39 PM Dave Marion <[email protected]> wrote:
> >
> > > Thanks for reporting this. Based on the information you provided I was
> > > able to create https://github.com/apache/accumulo/pull/4838. It
> appears
> > > that the Manager, Monitor, and SimpleGarbageCollector are creating
> multiple
> > > instances of ServiceLock when in a loop waiting to acquire the lock
> (when
> > > they are the standby node). The ServiceLock constructor creates a
> Watcher
> > > in the ZooKeeper client, which is likely causing the problem you are
> > > having. The Manager and Monitor operate a little differently and thus
> do
> > > not exhibit the same OOME problem.
> > >
> > > On 2024/08/26 12:13:50 Craig Portoghese wrote:
> > > > Wasn't sure if this was bug territory or an issue with cluster
> > > > configuration.
> > > >
> > > > In my dev environment, I have a 5-server AWS EMR cluster using
> Accumulo
> > > > 2.1.2, Hadoop 3.3.6, and Zookeeper 3.5.10.The cluster is in high
> > > > availability mode so there are 3 primary nodes with Zookeeper
> running. On
> > > > the primary nodes I run the manager, monitor, and gc processes. On
> the 2
> > > > core nodes (with DataNode on them) I run just tablet servers.
> > > >
> > > > The manager and monitor processes on the 2nd and 3rd servers are
> fine, no
> > > > problems about not being the leader for their process. However, the
> 2nd
> > > and
> > > > 3rd GC processes will repeatedly complain in a DEBUG "Failed to
> acquire
> > > > lock". It will complain that there is already a gc lock, and then
> create
> > > an
> > > > ephemeral node #0000000001, then #0000000002, etc. After about 8
> hours of
> > > > this complaint loop, it will turn into an error "Called
> > > > determineLockOwnership() when ephemeralNodeName == null", which it
> spams
> > > > forever, filling up the server and eventually killing the server.
> > > >
> > > > This has happened in multiple environments. Is it an issue with GC's
> > > > ability to hold elections? Should I be putting the standby GC
> processes
> > > on
> > > > a different node than the one running one of the zookeepers? Below
> are
> > > > samples of the two log types:
> > > >
> > > > 2024-08-24T15:28:03,292 [gc.SimpleGarbageCollector] INFO : Trying to
> > > > acquire ZooKeeper lock for garbage collector
> > > > 2024-08-24T15:28:03,330 [metrics.MetricsUtil] INFO : Metric producer
> > > > ThriftMetrics initialize
> > > > 2024-08-24T15:28:03,335 [rpc.TServerUtils] DEBUG: Instantiating
> unsecure
> > > > custom half-async Thrift server
> > > > 2024-08-24T15:28:03,348 [gc.SimpleGarbageCollector] DEBUG: Starting
> > > garbage
> > > > collector listening on coreNode1.example.domain:9998
> > > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Ephemeral node
> > > >
> > >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > > > created
> > > > 2024-08-24T15:28:59,694 [zookeeper.ServiceLock] DEBUG:
> > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Setting watcher on
> > > >
> > >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Lock held by another
> > > process
> > > > with ephemeral node:
> > > zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> > > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Establishing watch on
> prior
> > > > node
> > > >
> > >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#85f3ef81-b877-4321-a416-a24faec6f032#0000000000
> > > > 2024-08-24T15:28:59,695 [zookeeper.ServiceLock] DEBUG:
> > > > [zlock#a1a993fd-9086-4473-9545-113a865ca539#] Failed to acquire lock
> in
> > > > tryLock(), deleting all at path:
> > > >
> > >
> /accumulo/f73e29fc-62b3-44bf-8d6d-694a6a262a98/gc/lock/zlock#a1a993fd-9086-4473-9545-113a865ca539#0000000057
> > > > 2024-08-24T15:28:59,697 [gc.SimpleGarbageCollector] DEBUG: Failed to
> get
> > > GC
> > > > ZooKeeper lock, will retry
> > > >
> > > > 2024-08-25T21:48:31,418 [zookeeper.ClientCnxn] ERROR: Error while
> calling
> > > > watcher
> > > > java.lang.IllegalStateException: Called determineLockOwnership() when
> > > > ephemeralNodeName == null
> > > >         at
> > > >
> > >
> org.apache.accumulo.core.fate.zookeeper.ServiceLock.determineLockOwnership(ServiceLock.java:274)
> > > > ~[accumulo-core-2.1.2.jar:2.1.2]
> > > >         at
> > > >
> > >
> org.apache.accumulo.core.fate.zookeeper.ServiceLock$1.process(ServiceLock.java:354)
> > > > ~[accumulo-core-2.1.2.jar:2.1.2]
> > > >         at
> > > >
> > >
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:532)
> > > > ~[zookeeper-3.5.10.jar:3.5.10--1]
> > > >         at
> > > > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
> > > > ~[zookeeper-3.5.10.jar:3.5.10--1]
> > > >
> > >
> >
>

Re: Issue with second/third GC processes in a cluster error spam/OoM

Reply via email to