+1 for ExitOnOutOfMemoryError.
Dump could be an option, disabled by default.
—————————————————
Jialin Qiao
Apache IoTDB PMC

Yuan Tian <[email protected]> 于2023年11月14日周二 14:10写道:
>
> Hi,
>
> +HeapDumpOnOutOfMemroyError is not provided as default jvm args, I've
> comment that out, just to provide convenience for DBAs when they want to
> use this parameter.
>
>
> On Tue, Nov 14, 2023 at 2:00 PM ZhangJian He <[email protected]> wrote:
>
> > +1 for ExitOnOutOfMemoryError.
> > +HeapDumpOnOutOfMemroyError may produce very large files. How to clean up
> > these old files? WDYT?
> >
> > Thanks
> > ZhangJian He
> >
> >
> > On Tue, 14 Nov 2023 at 12:04, Gaofei Cao <[email protected]> wrote:
> >
> > > +1,
> > >
> > >  `-XX:+ExitOnOutOfMemoryError` parameter can avoid the loss of some
> > > key threads, it will be beneficial to the system.
> > > If the IoTDB cluster is deployed on k8s, this parameter is more
> > > indispensable, because k8s can dispatch another pod to replace this
> > > OOM node rapidly.
> > > Besides, i think we can add the usage of `-XX:+ExitOnOutOfMemoryError`
> > > and `-XX:+HeapDumpOnOutOfMemoryError` in the user/DBA manual, which is
> > > important to find the root cause of OOM.
> > >
> > > Best,
> > > ----------------------
> > > Gaofei Cao
> > >
> > > Yuan Tian <[email protected]> 于2023年11月13日周一 19:52写道:
> > > >
> > > > Hi all,
> > > >
> > > > Recently, we found in some real user cases that when OOM occurs in the
> > > > DataNode process (although we should ensure that OOM does not happen,
> > but
> > > > we all know that bugs will always exist), some threads(e.g. rpc
> > listening
> > > > threads) may exit unexpectedly which may cause some strange things to
> > > > happen. For example, if the heartbeat listening thread on the DataNode
> > > > unexpectedly exits due to OOM, and then the OOM recovers on its own
> > (some
> > > > large queries end, or some compaction tasks end), but this thread will
> > > > never exist again, causing the DataNode to remain in unknown state,
> > > because
> > > > the ConfigNode can no longer contact it via heartbeat.
> > > >
> > > > Therefore, we feel that OOM is a high-risk error, and we should let the
> > > > process exit directly to avoid the loss of some key threads.
> > > >
> > > > And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
> > > > -XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
> > > > keep both in jvm args and when OOM happens, it will firstly dump the
> > heap
> > > > memory and then exit.
> > > >
> > > > I've made this change in my pr(
> > > https://github.com/apache/iotdb/pull/11531).
> > > >
> > > > What do you think?
> > > >
> > > >
> > > >
> > > >
> > > > Best,
> > > > ----------------------
> > > > Yuan Tian
> > >
> >

Reply via email to