Re: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

Christopher Tue, 23 Nov 2021 08:50:48 -0800

Have you analyzed your heap to determine how much you actually need?
If you're using native maps, those live outside of the Java heap, so
the main thing you'd be using the heap for is the block cache, and any
custom iterators you've written and deployed that use a lot of memory
themselves. Every use case is unique, so it's hard to provide a
concrete recommendation, but I'd have expected substantially less,
like between 4G and 16G, or maybe a little more, depending on how much
you want to use for the block cache. My guess is with 80G, you're
probably not using most of that. That might be okay for an HDFS
NameNode, or maybe even a ZooKeeper node, but for an Accumulo tserver,
it's probably overkill. But, without knowing the totality of factors
in play for you, it's really hard to say what the right number should
be. I can only provide considerations for you to investigate and apply
to your own scenario.


Also keep in mind that Hadoop native libraries are used for
compression, and you may need extra memory outside the heap for gzip,
and also for other services running on the machine. You probably also
want to be careful about overprovisioning a Linux OS beyond 90% of its
installed RAM, to ensure you're not starving the OS itself.

Also, some modern Linux systems now run systemd-oomd by default. It
can be a bit aggressive in killing some processes when there's swap
pressure, in order to improve system reliability prior to the kernel's
built-in OOM killer running (replacing the earlyoom service, which I
never used). I ended up masking the service in my Fedora 34 instance
for Jenkins. I'm not sure what the right config should be for a
production system, though.

On Tue, Nov 23, 2021 at 6:00 AM Ligade, Shailesh [USA]
<[email protected]> wrote:
>
> Yes we are using native library...i was thinking to reduce the heap to 65G....
>
> -S
>
> -----Original Message-----
> From: Christopher <[email protected]>
> Sent: Monday, November 22, 2021 7:20 PM
> To: accumulo-user <[email protected]>
> Subject: Re: [External] Re: acumulo 1.10.0 tserver goes down under heavy 
> ingest
>
> I don't know how to tune the oom killer, but I do wonder why you would need 
> an 80G Java heap. That seems excessive to me. Are you using the native map 
> library?
>
> On Mon, Nov 22, 2021 at 7:06 PM Ligade, Shailesh [USA] 
> <[email protected]> wrote:
> >
> > Thanks Christopher,
> >
> > It is actually oom killer. So how can I prevent it? I mean I have Xmx/s set 
> > to 80G on a 128G machine. So some process is hogging the memory. On normal 
> > usage I don't see the issue but under bulk ingest I see the issue.
> > I am going to try to reduce heap and test, but I really don't want to 
> > starve tserver either. I added mode tservers, hoping that reducing number 
> > of tablets per tserver might help, but it didn't.
> > Do you recommend to set oom_score_adj say -100?
> >
> > Appreciate your help
> >
> > -S
> >
> > -----Original Message-----
> > From: Christopher <[email protected]>
> > Sent: Monday, November 22, 2021 12:23 PM
> > To: accumulo-user <[email protected]>
> > Subject: [External] Re: acumulo 1.10.0 tserver goes down under heavy
> > ingest
> >
> > That log message is basically just reporting that the connection to ZK 
> > failed. It's not very helpful in determining what led to that. You'll 
> > probably have to gather additional evidence to track down the problem.
> > Check the master and tserver logs prior to the crash, as well as the 
> > ZooKeeper logs. If you can detect the manager or a tserver in a bad state, 
> > try to capture a jstack of its process ID. Also check for system log 
> > messages, such as the oom-killer running and killing your processes.
> >
> > On Mon, Nov 22, 2021 at 12:04 PM Ligade, Shailesh [USA] 
> > <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > I have 8 node cluster, under heavy load a tserver goes down, we have 
> > > systemd unit file to auto restart, but that causes unassigned tablet for 
> > > an hour.
> > >
> > > In the log of restarted tserver i see
> > > WARN: Saw (possibly) transient exception communicating with
> > > zookeeper and then error KeeperErrorCode = ConnectionLoss for
> > > /accumulo/<instance >/xxx KeeperErrroCode = ConnectionLoss
> > >     at KeeperExcetion.create(KeeperException.java:102)
> > >     at KeeperExcetion.create(KeeperException.java:54)
> > >     at org.apache.zookeeper.Zookeeper.getChildren(zookeeper.java:2736)
> > >     at org.apache.zookeeper.Zookeeper.getChildren(zookeper.java:2762)
> > >     at
> > > org.apache.accumulo.fate.zookeeper.ZooReader.getChildren(ZooReader.j
> > > av
> > > a:159)
> > > xxxxx
> > >
> > > Any suggestions?
> > >
> > > -S

Re: [External] Re: acumulo 1.10.0 tserver goes down under heavy ingest

Reply via email to