[ 
https://issues.apache.org/jira/browse/SAMZA-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marouane RAJI updated SAMZA-2265:
---------------------------------
    Description: 
Hi, 

We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0 
then to 1.1. It seems that in both later versions we would have a memory leak. 
This ever-increasing memory would lead to containers failing/ yarn restarting 
them.
 It is worth noticing that we upgraded other smaller (in container specs and 
throughput) samza jobs without any issues.

specs about job : 
 * reading ~70k msg/sec 
 * 211 input topic , including one broadcasting one (2 msg/day, used for config 
updates)
 * 1 output topic.

```

job.container.count : 110

yarn.container.memory.mb=4000
 yarn.container.cpu.cores=8
 yarn.am.container.cpu.cores=8
 yarn.am.container.memory.mb=1024
 task.opts=-Xmx2800M
 task.checkpoint.replication.factor=2

 ```

Below, memory consumption in both versions for one container

!image-2019-07-01-09-47-11-241.png!

 

Heap-dumps comparison: 

!image-2019-07-01-09-48-45-876.png!

 

The difference between both version keep increasing slowly, the main cause of 
that in the increase in byte[]

In the 1.0 and 1.1 version the main reference holding these bytes seems to be  
KafkaCheckpointManager: 
 !image-2019-07-01-09-50-04-693.png!

 

We have found this PR that should be deployed in 1.1 
[https://github.com/apache/samza/pull/993], not sure if it can be related to 
this ?

Thanks. 

 

 

  was:
Hi, 

We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0 
then to 1.1. It seems that in both later versions we would have a memory leak. 
This ever-increasing memory would lead to containers failing/ yarn restarting 
them.
 It is worth noticing that we upgraded other smaller (in container specs and 
throughput) samza jobs without any issues.

specs about job : 
 * reading ~70k msg/sec 
 * 211 input topic , including one broadcasting one (2 msg/day, used for config 
updates)
 * 1 output topic.

```

job.container.count : 110

yarn.container.memory.mb=4000
 yarn.container.cpu.cores=8
 yarn.am.container.cpu.cores=8
 yarn.am.container.memory.mb=1024
 task.opts=-Xmx2800M
 task.checkpoint.replication.factor=2

 ```

Below, memory consumption in both versions for one container

!image-2019-07-01-09-47-11-241.png!

 

Heap-dumps comparison: 

!image-2019-07-01-09-48-45-876.png!

 

The difference between both version keep increasing slowly, the main cause of 
that in the increase in byte[]

In the 1.0 and 1.1 version the main reference holding these bytes seems to be  
KafkaCheckpointManager: 
 !image-2019-07-01-09-50-04-693.png!

 

Could this PR solves this issues [https://github.com/apache/samza/pull/993] ? 
as, we would be releasing KafkaConsumer used for checkpointing ? 

Thanks. 

 

 


> Memory leak potentially due to Kafka Checkpoint Management
> ----------------------------------------------------------
>
>                 Key: SAMZA-2265
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2265
>             Project: Samza
>          Issue Type: Bug
>    Affects Versions: 1.0, 1.1
>         Environment:  
>  
>            Reporter: Marouane RAJI
>            Priority: Major
>         Attachments: image-2019-07-01-09-47-11-241.png, 
> image-2019-07-01-09-48-45-876.png, image-2019-07-01-09-50-04-693.png
>
>
> Hi, 
> We recently upgraded one of our high throughput samza jobs from 0.13.1 to 1.0 
> then to 1.1. It seems that in both later versions we would have a memory 
> leak. This ever-increasing memory would lead to containers failing/ yarn 
> restarting them.
>  It is worth noticing that we upgraded other smaller (in container specs and 
> throughput) samza jobs without any issues.
> specs about job : 
>  * reading ~70k msg/sec 
>  * 211 input topic , including one broadcasting one (2 msg/day, used for 
> config updates)
>  * 1 output topic.
> ```
> job.container.count : 110
> yarn.container.memory.mb=4000
>  yarn.container.cpu.cores=8
>  yarn.am.container.cpu.cores=8
>  yarn.am.container.memory.mb=1024
>  task.opts=-Xmx2800M
>  task.checkpoint.replication.factor=2
>  ```
> Below, memory consumption in both versions for one container
> !image-2019-07-01-09-47-11-241.png!
>  
> Heap-dumps comparison: 
> !image-2019-07-01-09-48-45-876.png!
>  
> The difference between both version keep increasing slowly, the main cause of 
> that in the increase in byte[]
> In the 1.0 and 1.1 version the main reference holding these bytes seems to be 
>  KafkaCheckpointManager: 
>  !image-2019-07-01-09-50-04-693.png!
>  
> We have found this PR that should be deployed in 1.1 
> [https://github.com/apache/samza/pull/993], not sure if it can be related to 
> this ?
> Thanks. 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to