Re: Add -XX:+ExitOnOutOfMemoryError as default JVM parameter in datanode-env and confignode-env

2023-11-14 Thread Jialin Qiao
+1 for ExitOnOutOfMemoryError.
Dump could be an option, disabled by default.
—
Jialin Qiao
Apache IoTDB PMC

Yuan Tian  于2023年11月14日周二 14:10写道:
>
> Hi,
>
> +HeapDumpOnOutOfMemroyError is not provided as default jvm args, I've
> comment that out, just to provide convenience for DBAs when they want to
> use this parameter.
>
>
> On Tue, Nov 14, 2023 at 2:00 PM ZhangJian He  wrote:
>
> > +1 for ExitOnOutOfMemoryError.
> > +HeapDumpOnOutOfMemroyError may produce very large files. How to clean up
> > these old files? WDYT?
> >
> > Thanks
> > ZhangJian He
> >
> >
> > On Tue, 14 Nov 2023 at 12:04, Gaofei Cao  wrote:
> >
> > > +1,
> > >
> > >  `-XX:+ExitOnOutOfMemoryError` parameter can avoid the loss of some
> > > key threads, it will be beneficial to the system.
> > > If the IoTDB cluster is deployed on k8s, this parameter is more
> > > indispensable, because k8s can dispatch another pod to replace this
> > > OOM node rapidly.
> > > Besides, i think we can add the usage of `-XX:+ExitOnOutOfMemoryError`
> > > and `-XX:+HeapDumpOnOutOfMemoryError` in the user/DBA manual, which is
> > > important to find the root cause of OOM.
> > >
> > > Best,
> > > --
> > > Gaofei Cao
> > >
> > > Yuan Tian  于2023年11月13日周一 19:52写道:
> > > >
> > > > Hi all,
> > > >
> > > > Recently, we found in some real user cases that when OOM occurs in the
> > > > DataNode process (although we should ensure that OOM does not happen,
> > but
> > > > we all know that bugs will always exist), some threads(e.g. rpc
> > listening
> > > > threads) may exit unexpectedly which may cause some strange things to
> > > > happen. For example, if the heartbeat listening thread on the DataNode
> > > > unexpectedly exits due to OOM, and then the OOM recovers on its own
> > (some
> > > > large queries end, or some compaction tasks end), but this thread will
> > > > never exist again, causing the DataNode to remain in unknown state,
> > > because
> > > > the ConfigNode can no longer contact it via heartbeat.
> > > >
> > > > Therefore, we feel that OOM is a high-risk error, and we should let the
> > > > process exit directly to avoid the loss of some key threads.
> > > >
> > > > And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
> > > > -XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
> > > > keep both in jvm args and when OOM happens, it will firstly dump the
> > heap
> > > > memory and then exit.
> > > >
> > > > I've made this change in my pr(
> > > https://github.com/apache/iotdb/pull/11531).
> > > >
> > > > What do you think?
> > > >
> > > >
> > > >
> > > >
> > > > Best,
> > > > --
> > > > Yuan Tian
> > >
> >


Re: Add -XX:+ExitOnOutOfMemoryError as default JVM parameter in datanode-env and confignode-env

2023-11-13 Thread Yuan Tian
Hi,

+HeapDumpOnOutOfMemroyError is not provided as default jvm args, I've
comment that out, just to provide convenience for DBAs when they want to
use this parameter.


On Tue, Nov 14, 2023 at 2:00 PM ZhangJian He  wrote:

> +1 for ExitOnOutOfMemoryError.
> +HeapDumpOnOutOfMemroyError may produce very large files. How to clean up
> these old files? WDYT?
>
> Thanks
> ZhangJian He
>
>
> On Tue, 14 Nov 2023 at 12:04, Gaofei Cao  wrote:
>
> > +1,
> >
> >  `-XX:+ExitOnOutOfMemoryError` parameter can avoid the loss of some
> > key threads, it will be beneficial to the system.
> > If the IoTDB cluster is deployed on k8s, this parameter is more
> > indispensable, because k8s can dispatch another pod to replace this
> > OOM node rapidly.
> > Besides, i think we can add the usage of `-XX:+ExitOnOutOfMemoryError`
> > and `-XX:+HeapDumpOnOutOfMemoryError` in the user/DBA manual, which is
> > important to find the root cause of OOM.
> >
> > Best,
> > --
> > Gaofei Cao
> >
> > Yuan Tian  于2023年11月13日周一 19:52写道:
> > >
> > > Hi all,
> > >
> > > Recently, we found in some real user cases that when OOM occurs in the
> > > DataNode process (although we should ensure that OOM does not happen,
> but
> > > we all know that bugs will always exist), some threads(e.g. rpc
> listening
> > > threads) may exit unexpectedly which may cause some strange things to
> > > happen. For example, if the heartbeat listening thread on the DataNode
> > > unexpectedly exits due to OOM, and then the OOM recovers on its own
> (some
> > > large queries end, or some compaction tasks end), but this thread will
> > > never exist again, causing the DataNode to remain in unknown state,
> > because
> > > the ConfigNode can no longer contact it via heartbeat.
> > >
> > > Therefore, we feel that OOM is a high-risk error, and we should let the
> > > process exit directly to avoid the loss of some key threads.
> > >
> > > And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
> > > -XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
> > > keep both in jvm args and when OOM happens, it will firstly dump the
> heap
> > > memory and then exit.
> > >
> > > I've made this change in my pr(
> > https://github.com/apache/iotdb/pull/11531).
> > >
> > > What do you think?
> > >
> > >
> > >
> > >
> > > Best,
> > > --
> > > Yuan Tian
> >
>


Re: Add -XX:+ExitOnOutOfMemoryError as default JVM parameter in datanode-env and confignode-env

2023-11-13 Thread ZhangJian He
+1 for ExitOnOutOfMemoryError.
+HeapDumpOnOutOfMemroyError may produce very large files. How to clean up
these old files? WDYT?

Thanks
ZhangJian He


On Tue, 14 Nov 2023 at 12:04, Gaofei Cao  wrote:

> +1,
>
>  `-XX:+ExitOnOutOfMemoryError` parameter can avoid the loss of some
> key threads, it will be beneficial to the system.
> If the IoTDB cluster is deployed on k8s, this parameter is more
> indispensable, because k8s can dispatch another pod to replace this
> OOM node rapidly.
> Besides, i think we can add the usage of `-XX:+ExitOnOutOfMemoryError`
> and `-XX:+HeapDumpOnOutOfMemoryError` in the user/DBA manual, which is
> important to find the root cause of OOM.
>
> Best,
> --
> Gaofei Cao
>
> Yuan Tian  于2023年11月13日周一 19:52写道:
> >
> > Hi all,
> >
> > Recently, we found in some real user cases that when OOM occurs in the
> > DataNode process (although we should ensure that OOM does not happen, but
> > we all know that bugs will always exist), some threads(e.g. rpc listening
> > threads) may exit unexpectedly which may cause some strange things to
> > happen. For example, if the heartbeat listening thread on the DataNode
> > unexpectedly exits due to OOM, and then the OOM recovers on its own (some
> > large queries end, or some compaction tasks end), but this thread will
> > never exist again, causing the DataNode to remain in unknown state,
> because
> > the ConfigNode can no longer contact it via heartbeat.
> >
> > Therefore, we feel that OOM is a high-risk error, and we should let the
> > process exit directly to avoid the loss of some key threads.
> >
> > And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
> > -XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
> > keep both in jvm args and when OOM happens, it will firstly dump the heap
> > memory and then exit.
> >
> > I've made this change in my pr(
> https://github.com/apache/iotdb/pull/11531).
> >
> > What do you think?
> >
> >
> >
> >
> > Best,
> > --
> > Yuan Tian
>


Re: Add -XX:+ExitOnOutOfMemoryError as default JVM parameter in datanode-env and confignode-env

2023-11-13 Thread Gaofei Cao
+1,

 `-XX:+ExitOnOutOfMemoryError` parameter can avoid the loss of some
key threads, it will be beneficial to the system.
If the IoTDB cluster is deployed on k8s, this parameter is more
indispensable, because k8s can dispatch another pod to replace this
OOM node rapidly.
Besides, i think we can add the usage of `-XX:+ExitOnOutOfMemoryError`
and `-XX:+HeapDumpOnOutOfMemoryError` in the user/DBA manual, which is
important to find the root cause of OOM.

Best,
--
Gaofei Cao

Yuan Tian  于2023年11月13日周一 19:52写道:
>
> Hi all,
>
> Recently, we found in some real user cases that when OOM occurs in the
> DataNode process (although we should ensure that OOM does not happen, but
> we all know that bugs will always exist), some threads(e.g. rpc listening
> threads) may exit unexpectedly which may cause some strange things to
> happen. For example, if the heartbeat listening thread on the DataNode
> unexpectedly exits due to OOM, and then the OOM recovers on its own (some
> large queries end, or some compaction tasks end), but this thread will
> never exist again, causing the DataNode to remain in unknown state, because
> the ConfigNode can no longer contact it via heartbeat.
>
> Therefore, we feel that OOM is a high-risk error, and we should let the
> process exit directly to avoid the loss of some key threads.
>
> And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
> -XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
> keep both in jvm args and when OOM happens, it will firstly dump the heap
> memory and then exit.
>
> I've made this change in my pr(https://github.com/apache/iotdb/pull/11531).
>
> What do you think?
>
>
>
>
> Best,
> --
> Yuan Tian


Add -XX:+ExitOnOutOfMemoryError as default JVM parameter in datanode-env and confignode-env

2023-11-13 Thread Yuan Tian
Hi all,

Recently, we found in some real user cases that when OOM occurs in the
DataNode process (although we should ensure that OOM does not happen, but
we all know that bugs will always exist), some threads(e.g. rpc listening
threads) may exit unexpectedly which may cause some strange things to
happen. For example, if the heartbeat listening thread on the DataNode
unexpectedly exits due to OOM, and then the OOM recovers on its own (some
large queries end, or some compaction tasks end), but this thread will
never exist again, causing the DataNode to remain in unknown state, because
the ConfigNode can no longer contact it via heartbeat.

Therefore, we feel that OOM is a high-risk error, and we should let the
process exit directly to avoid the loss of some key threads.

And I did an experiment and found that -XX:+ExitOnOutOfMemoryError and
-XX:+HeapDumpOnOutOfMemoryError do not conflict which means that we can
keep both in jvm args and when OOM happens, it will firstly dump the heap
memory and then exit.

I've made this change in my pr(https://github.com/apache/iotdb/pull/11531).

What do you think?




Best,
--
Yuan Tian