[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-03 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851730#comment-15851730
 ] 

Ismael Juma commented on KAFKA-4725:


Nice catch, a contribution via a PR would be welcome indeed.

> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.3.0, 0.10.2.1
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-03 Thread Tim Carey-Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851880#comment-15851880
 ] 

Tim Carey-Smith commented on KAFKA-4725:


Hi there, 

Jeff and I have prototyped a fix for this bug. We repeated our stress tests 
against a new build and have not yet been able to reproduce the leak. 

The branch is hosted on GitHub at 
https://github.com/apache/kafka/compare/0.10.1.1...heroku:fix-throttled-response-leak

Before we open a PR, which base branch should we set as the target for the PR?

Thanks, 
Tim

> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.3.0, 0.10.2.1
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-03 Thread Ismael Juma (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851885#comment-15851885
 ] 

Ismael Juma commented on KAFKA-4725:


Great. The target should be trunk, we cherry-pick to other branches during the 
merge.

> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.3.0, 0.10.2.1
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-03 Thread Jeff Chao (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851898#comment-15851898
 ] 

Jeff Chao commented on KAFKA-4725:
--

Ok, we'll base it off trunk and open up a PR. Thanks.

> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.3.0, 0.10.2.1
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-03 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852189#comment-15852189
 ] 

ASF GitHub Bot commented on KAFKA-4725:
---

GitHub user halorgium opened a pull request:

https://github.com/apache/kafka/pull/2496

KAFKA-4725: Stop leaking messages in produce request body when requests are 
delayed

This change is in response to 
[KAFKA-4725](https://issues.apache.org/jira/browse/KAFKA-4725). 

When a produce request is received, if the user/client is exceeding their 
produce quota, the response will be delayed until the quota is refilled 
appropriately. 

Unfortunately, the request body is still referenced in the callback which 
in turn leaks the messages contained within the request. 

This change allows the `KafkaApis` method to take ownership of the request 
body from the `RequestChannel.Request` object. 

I am not sure whether this breaks other invariants which are assumed within 
other parts of Kafka. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heroku/kafka fix-throttled-response-leak

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/kafka/pull/2496.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2496


commit ddb0541b156db546fbf6e065670fb25d6e4baba2
Author: Tim Carey-Smith 
Date:   2017-02-01T23:18:43Z

Stop leaking produce request in throttled requests

Further isolate the request from the callbacks

Remove pointless changes

Move body ownership logic into RequestChannel




> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.3.0, 0.10.2.1
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-07 Thread Tim Carey-Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856965#comment-15856965
 ] 

Tim Carey-Smith commented on KAFKA-4725:


We have run stress tests on builds which include this patch based on the 
0.10.1.1 tag, the 0.10.2 branch and the trunk branch. 
We were unable to reproduce the memory leak and feel comfortable with this 
change. 

> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.2.0, 0.10.3.0
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (KAFKA-4725) Kafka broker fails due to OOM when producer exceeds throttling quota for extended periods of time

2017-02-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-4725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856990#comment-15856990
 ] 

ASF GitHub Bot commented on KAFKA-4725:
---

Github user asfgit closed the pull request at:

https://github.com/apache/kafka/pull/2496


> Kafka broker fails due to OOM when producer exceeds throttling quota for 
> extended periods of time
> -
>
> Key: KAFKA-4725
> URL: https://issues.apache.org/jira/browse/KAFKA-4725
> Project: Kafka
>  Issue Type: Bug
>  Components: core, producer 
>Affects Versions: 0.10.1.1
> Environment: Ubuntu Trusty (14.04.5), Oracle JDK 8
>Reporter: Jeff Chao
>Priority: Critical
>  Labels: reliability
> Fix For: 0.10.2.0, 0.10.3.0
>
> Attachments: oom-references.png
>
>
> Steps to Reproduce:
> 1. Create a non-compacted topic with 1 partition
> 2. Set a produce quota of 512 KB/s
> 3. Send messages at 20 MB/s
> 4. Observe heap memory growth as time progresses
> Investigation:
> While running performance tests with a user configured with a produce quota, 
> we found that the lead broker serving the requests would exhaust heap memory 
> if the producer sustained a inbound request throughput greater than the 
> produce quota. 
> Upon further investigation, we took a heap dump from that broker process and 
> discovered the ThrottledResponse object has a indirect reference to the 
> byte[] holding the messages associated with the ProduceRequest. 
> We're happy contributing a patch but in the meantime wanted to first raise 
> the issue and get feedback from the community.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)