Re: oomkill issue

2023-12-04 Thread Yu Chen
Hi Prashant,
Can you describe the steps you use to run `jeprof` in detail?

In my case, I did it by logging in to Taskmanager's shell command line and then 
operating it through shell commands. But I am confused that I saw the curl 
operation in the error log you provided.

Also, it is true that rocksdb is not perfect in memory control and it is 
possible to exceed the managed memory limit, you can refer to the 
documentation[1] for more details.

[1] Write buffer manager internals - Google 文档 
<https://docs.google.com/document/d/1_4Brwy2Axzzqu7SJ4hLLl92hVCpeRlVEG-fj8KsTMUo/edit#heading=h.f5wfmsmpemd0>

Best,
Yu Chen

> 2023年12月5日 04:42,prashant parbhane  写道:
> 
> Hi Yu,
> 
> Thanks for your reply.
> 
> When i run below script
> 
> ```
> jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
> 1009.svg
> ```
> i am getting below error
> 
> ```
> Gathering CPU profile from http:///pprof/profile?seconds=30 for 30 seconds to
>   /root/jeprof/java.1701718686.
> Be patient...
> Failed to get profile: curl -s --fail --max-time 90 
> 'http:///pprof/profile?seconds=30' > /root/jeprof/.tmp.java.1701718686.: No 
> such file or directory
> ```
> Any input on this?
> 
> However, oomkill was resolve with below rocksdb configurations
> • "state.backend.rocksdb.memory.managed": "false", 
> "state.backend.rocksdb.block.cache-size": "10m", 
> "state.backend.rocksdb.writebuffer.size": "128m",
> "state.backend.rocksdb.writebuffer.count": "134217728"
> "state.backend.rocksdb.ttl.compaction.filter.enabled":"true"
> 
> 
> Thanks,
> Prashant
> 
> On Mon, Nov 27, 2023 at 7:11 PM Xuyang  wrote:
> Hi, Prashant.
> I think Yu Chen has given a professional troubleshooting ideas. Another thing 
> I want to ask is whether you use some 
> user defined function to store some objects? You can firstly dump the memory 
> and get more details to check for memory leaks.
> 
> --
> Best!
> Xuyang
> 在 2023-11-28 09:12:01,"Yu Chen"  写道:
> Hi Prashant,
> 
> OOMkill was mostly caused by workset memory exceed the pod limit. 
> We have to first expand the OVERHEAD memory properly by the following params 
> to observe if the problem can be solved.
> ```
> taskmanager.memory.jvm-overhead.max=1536m
> taskmanager.memory.jvm-overhead.min=1536m
> ```
> 
> And if the OOMKill still exists, we need to suspect if the task has an 
> off-heap memory leak.
> One of the most popular tools, jemallc, is recommended. You have to install 
> the jemalloc in the image arrording to the document[1].
> After that, you can enable jemalloc profiling by setting environment for the 
> taskmanager:
> ```
> containerized.taskmanager.env.MALLOC_CONF=prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out
>   ```
> After running for a while, you can log into the Taskmanager to generate svg 
> files to troubleshoot off-heap memory distribution.
> ```
> jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
> 1009.svg
> ```
> 
> Otherwise, if the OOMKill no longer occurs, but the GC overhead limit 
> exceeded, then you should dump heap memory to find out what objects are 
> taking up so much of the memory.
> Here is the command for you.
> ```
> jmap -dump:live,format=b,file=/tmp/heap.hprof 
> ```
> 
> [1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes 
> 202308.01 documentation
> 
> Best,
> Yu Chen
> 发件人: prashant parbhane 
> 发送时间: 2023年11月28日 1:42
> 收件人: user@flink.apache.org 
> 主题: oomkill issue   Hello,  
> 
> We have been facing this oomkill issue, where task managers are getting 
> restarted with this error.
> I am seeing memory consumption increasing in a linear manner, i have given 
> memory and CPU as high as possible but still facing the same issue.
> 
> We are using rocksdb for the state backend, is there a way to find which 
> operator causing this issue? or find which operator takes more memory? Any 
> good practice that we can follow? We are using broadcast state.
> 
> Thanks,
> Prashant



Re: 回复: oomkill issue

2023-12-04 Thread prashant parbhane
Hi Yu,

Thanks for your reply.

When i run below script

```
*jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap
 > 1009.svg*
```
i am getting below error

```



*Gathering CPU profile from http:///pprof/profile?seconds=30 for 30 seconds
to  /root/jeprof/java.1701718686.Be patient...Failed to get profile: curl
-s --fail --max-time 90 'http:///pprof/profile?seconds=30' >
/root/jeprof/.tmp.java.1701718686.: No such file or directory*
```
Any input on this?

However, oomkill was resolve with below rocksdb configurations

   - "state.backend.rocksdb.memory.managed": "false",
   "state.backend.rocksdb.block.cache-size": "10m",
   "state.backend.rocksdb.writebuffer.size": "128m",
   "state.backend.rocksdb.writebuffer.count": "134217728"
   "state.backend.rocksdb.ttl.compaction.filter.enabled":"true"


Thanks,
Prashant

On Mon, Nov 27, 2023 at 7:11 PM Xuyang  wrote:

> Hi, Prashant.
> I think Yu Chen has given a professional troubleshooting ideas. Another
> thing I want to ask is whether you use some
> user defined function to store some objects? You can firstly dump the
> memory and get more details to check for memory leaks.
>
>
> --
> Best!
> Xuyang
>
>
> 在 2023-11-28 09:12:01,"Yu Chen"  写道:
>
> Hi Prashant,
>
> OOMkill was mostly caused by workset memory exceed the pod limit.
> We have to first expand the OVERHEAD memory properly by the following
> params to observe if the problem can be solved.
> ```
> taskmanager.memory.jvm-overhead.max=1536m
> taskmanager.memory.jvm-overhead.min=1536m
> ```
>
> And if the OOMKill still exists, we need to suspect if the task has an
> off-heap memory leak.
> One of the most popular tools, jemallc, is recommended. You have to
> install the jemalloc in the image arrording to the document[1].
> After that, you can enable jemalloc profiling by setting environment for
> the taskmanager:
> ```
>
> containerized.taskmanager.env.MALLOC_CONF=
> prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out
>
> ```
> After running for a while, you can log into the Taskmanager to generate
> svg files to troubleshoot off-heap memory distribution.
> ```
> jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap
>  > 1009.svg
> ```
>
> Otherwise, if the OOMKill no longer occurs, but the GC overhead limit
> exceeded, then you should dump heap memory to find out what objects are
> taking up so much of the memory.
> Here is the command for you.
> ```
> jmap -dump:live,format=b,file=/tmp/heap.hprof 
> ```
>
> [1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes
> 202308.01 documentation <https://support.sentieon.com/appnotes/jemalloc/>
>
> Best,
> Yu Chen
> --
> *发件人:* prashant parbhane 
> *发送时间:* 2023年11月28日 1:42
> *收件人:* user@flink.apache.org 
> *主题:* oomkill issue
>
> Hello,
>
> We have been facing this oomkill issue, where task managers are getting
> restarted with this error.
> I am seeing memory consumption increasing in a linear manner, i have given
> memory and CPU as high as possible but still facing the same issue.
>
> We are using rocksdb for the state backend, is there a way to find which
> operator causing this issue? or find which operator takes more memory? Any
> good practice that we can follow? We are using broadcast state.
>
> Thanks,
> Prashant
>
>


Re:回复: oomkill issue

2023-11-27 Thread Xuyang
Hi, Prashant.
I think Yu Chen has given a professional troubleshooting ideas. Another thing I 
want to ask is whether you use some 
user defined function to store some objects? You can firstly dump the memory 
and get more details to check for memory leaks.




--

Best!
Xuyang




在 2023-11-28 09:12:01,"Yu Chen"  写道:

Hi Prashant,


OOMkill was mostly caused by workset memory exceed the pod limit. 
We have to first expand the OVERHEAD memory properly by the following params to 
observe if the problem can be solved.
```
taskmanager.memory.jvm-overhead.max=1536m
taskmanager.memory.jvm-overhead.min=1536m
```


And if the OOMKill still exists, we need to suspect if the task has an off-heap 
memory leak.
One of the most popular tools, jemallc, is recommended. You have to install the 
jemalloc in the image arrording to the document[1].
After that, you can enable jemalloc profiling by setting environment for the 
taskmanager:
```

containerized.taskmanager.env.MALLOC_CONF=prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out
 

```
After running for a while, you can log into the Taskmanager to generate svg 
files to troubleshoot off-heap memory distribution.
```
jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
1009.svg
```


Otherwise, if the OOMKill no longer occurs, but the GC overhead limit exceeded, 
then you should dump heap memory to find out what objects are taking up so much 
of the memory.
Here is the command for you.
```
jmap -dump:live,format=b,file=/tmp/heap.hprof 
```


[1] Using jemalloc to Optimize Memory Allocation — Sentieon Appnotes 202308.01 
documentation


Best,
Yu Chen
发件人: prashant parbhane 
发送时间: 2023年11月28日 1:42
收件人: user@flink.apache.org 
主题: oomkill issue
 
Hello, 


We have been facing this oomkill issue, where task managers are getting 
restarted with this error.
I am seeing memory consumption increasing in a linear manner, i have given 
memory and CPU as high as possible but still facing the same issue.


We are using rocksdb for the state backend, is there a way to find which 
operator causing this issue? or find which operator takes more memory? Any good 
practice that we can follow? We are using broadcast state.


Thanks,
Prashant

回复: oomkill issue

2023-11-27 Thread Yu Chen
Hi Prashant,

OOMkill was mostly caused by workset memory exceed the pod limit.
We have to first expand the OVERHEAD memory properly by the following params to 
observe if the problem can be solved.
```
taskmanager.memory.jvm-overhead.max=1536m
taskmanager.memory.jvm-overhead.min=1536m
```

And if the OOMKill still exists, we need to suspect if the task has an off-heap 
memory leak.
One of the most popular tools, jemallc, is recommended. You have to install the 
jemalloc in the image arrording to the document[1].
After that, you can enable jemalloc profiling by setting environment for the 
taskmanager:
```

containerized.taskmanager.env.MALLOC_CONF=prof:true,lg_prof_interval:30,lg_prof_sample:16,prof_prefix:/tmp/jeprof.out

```
After running for a while, you can log into the Taskmanager to generate svg 
files to troubleshoot off-heap memory distribution.
```
jeprof --show_bytes -svg `which java` /tmp/jeprof.out.301.1009.i1009.heap  > 
1009.svg
```

Otherwise, if the OOMKill no longer occurs, but the GC overhead limit exceeded, 
then you should dump heap memory to find out what objects are taking up so much 
of the memory.
Here is the command for you.
```
jmap -dump:live,format=b,file=/tmp/heap.hprof 
```

[1] Using jemalloc to Optimize Memory Allocation ― Sentieon Appnotes 202308.01 
documentation<https://support.sentieon.com/appnotes/jemalloc/>

Best,
Yu Chen

发件人: prashant parbhane 
发送时间: 2023年11月28日 1:42
收件人: user@flink.apache.org 
主题: oomkill issue

Hello,

We have been facing this oomkill issue, where task managers are getting 
restarted with this error.
I am seeing memory consumption increasing in a linear manner, i have given 
memory and CPU as high as possible but still facing the same issue.

We are using rocksdb for the state backend, is there a way to find which 
operator causing this issue? or find which operator takes more memory? Any good 
practice that we can follow? We are using broadcast state.

Thanks,
Prashant


oomkill issue

2023-11-27 Thread prashant parbhane
Hello,

We have been facing this oomkill issue, where task managers are getting
restarted with this error.
I am seeing memory consumption increasing in a linear manner, i have given
memory and CPU as high as possible but still facing the same issue.

We are using rocksdb for the state backend, is there a way to find which
operator causing this issue? or find which operator takes more memory? Any
good practice that we can follow? We are using broadcast state.

Thanks,
Prashant