[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-06-12 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
   Resolution: Implemented
Fix Version/s: 0.8.3
   Status: Resolved  (was: Patch Available)

[~eapache], thanks for reporting back. Since there is a workaround for this 
issue in 0.8.2 and the issue is already fixed in trunk, I will resolve this 
issue for now.

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Jun Rao
 Fix For: 0.8.3

 Attachments: KAFKA-2147.patch, KAFKA-2147_2015-05-14_09:41:56.patch, 
 KAFKA-2147_2015-05-15_16:14:44.patch, 
 craig-kafka-purgatory-queue-size-issue.png, purgatory.log, purgatory.log.gz, 
 watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-20 Thread Craig Allsop (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Allsop updated KAFKA-2147:

Attachment: craig-kafka-purgatory-queue-size-issue.png

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Jun Rao
 Attachments: KAFKA-2147.patch, KAFKA-2147_2015-05-14_09:41:56.patch, 
 KAFKA-2147_2015-05-15_16:14:44.patch, 
 craig-kafka-purgatory-queue-size-issue.png, purgatory.log, purgatory.log.gz, 
 watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly (KAFKA-1461, KAFKA-2082 and others). 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-15 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
Attachment: KAFKA-2147_2015-05-15_16:14:44.patch

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Jun Rao
 Attachments: KAFKA-2147.patch, KAFKA-2147_2015-05-14_09:41:56.patch, 
 KAFKA-2147_2015-05-15_16:14:44.patch, purgatory.log, watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
 in a very specific situation, the replica fetcher thread of one 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-15 Thread Evan Huus (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Huus updated KAFKA-2147:
-
Attachment: purgatory.log

Purgatory log from the attached patch.

Based on my understanding, the check is working (it is seeing the right number 
of elements), it just isn't waking up enough.

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Jun Rao
 Attachments: KAFKA-2147.patch, KAFKA-2147_2015-05-14_09:41:56.patch, 
 purgatory.log, watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-14 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
Attachment: KAFKA-2147_2015-05-14_09:41:56.patch

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Jun Rao
 Attachments: KAFKA-2147.patch, KAFKA-2147_2015-05-14_09:41:56.patch, 
 watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
 in a very specific situation, the replica fetcher thread of one broker can 
 spam another broker with requests that 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Evan Huus (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Evan Huus updated KAFKA-2147:
-
Attachment: watch-lists.log

watch list purging logs

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Joel Koshy
 Attachments: watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
 in a very specific situation, the replica fetcher thread of one broker can 
 spam another broker with requests that fill up its purgatory and do not get 
 properly 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
Assignee: Jun Rao  (was: Joel Koshy)
  Status: Patch Available  (was: Open)

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Jun Rao
 Attachments: KAFKA-2147.patch, watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
 in a very specific situation, the replica fetcher thread of one broker can 
 spam another broker with requests that fill up its 

[jira] [Updated] (KAFKA-2147) Unbalanced replication can cause extreme purgatory growth

2015-05-12 Thread Jun Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jun Rao updated KAFKA-2147:
---
Attachment: KAFKA-2147.patch

 Unbalanced replication can cause extreme purgatory growth
 -

 Key: KAFKA-2147
 URL: https://issues.apache.org/jira/browse/KAFKA-2147
 Project: Kafka
  Issue Type: Bug
  Components: purgatory, replication
Affects Versions: 0.8.2.1
Reporter: Evan Huus
Assignee: Joel Koshy
 Attachments: KAFKA-2147.patch, watch-lists.log


 Apologies in advance, this is going to be a bit of complex description, 
 mainly because we've seen this issue several different ways and we're still 
 tying them together in terms of root cause and analysis.
 It is worth noting now that we have all our producers set up to send 
 RequiredAcks==-1, and that this includes all our MirrorMakers.
 I understand the purgatory is being rewritten (again) for 0.8.3. Hopefully 
 that will incidentally fix this issue, or at least render it moot.
 h4. Symptoms
 Fetch request purgatory on a broker or brokers grows rapidly and steadily at 
 a rate of roughly 1-5K requests per second. Heap memory used also grows to 
 keep pace. When 4-5 million requests have accumulated in purgatory, the 
 purgatory is drained, causing a substantial latency spike. The node will tend 
 to drop leadership, replicate, and recover.
 h5. Case 1 - MirrorMaker
 We first noticed this case when enabling mirrormaker. We had one primary 
 cluster already, with many producers and consumers. We created a second, 
 identical cluster and enabled replication from the original to the new 
 cluster on some topics using mirrormaker. This caused all six nodes in the 
 new cluster to exhibit the symptom in lockstep - their purgatories would all 
 grow together, and get drained within about 20 seconds of each other. The 
 cluster-wide latency spikes at this time caused several problems for us.
 Turning MM on and off turned the problem on and off very precisely. When we 
 stopped MM, the purgatories would all drop to normal levels immediately, and 
 would start climbing again when we restarted it.
 Note that this is the *fetch* purgatories on the brokers that MM was 
 *producing* to, which indicates fairly strongly that this is a replication 
 issue, not a MM issue.
 This particular cluster and MM setup was abandoned for other reasons before 
 we could make much progress debugging.
 h5. Case 2 - Broker 6
 The second time we saw this issue was on the newest broker (broker 6) in the 
 original cluster. For a long time we were running with five nodes, and 
 eventually added a sixth to handle the increased load. At first, we moved 
 only a handful of higher-volume partitions to this broker. Later, we created 
 a group of new topics (totalling around 100 partitions) for testing purposes 
 that were spread automatically across all six nodes. These topics saw 
 occasional traffic, but were generally unused. At this point broker 6 had 
 leadership for about an equal number of high-volume and unused partitions, 
 about 15-20 of each.
 Around this time (we don't have detailed enough data to prove real 
 correlation unfortunately), the issue started appearing on this broker as 
 well, but not on any of the other brokers in the cluster.
 h4. Debugging
 The first thing we tried was to reduce the 
 `fetch.purgatory.purge.interval.requests` from the default of 1000 to a much 
 lower value of 200. This had no noticeable effect at all.
 We then enabled debug logging on broker06 and started looking through that. I 
 can attach complete log samples if necessary, but the thing that stood out 
 for us was a substantial number of the following lines:
 {noformat}
 [2015-04-23 20:05:15,196] DEBUG [KafkaApi-6] Putting fetch request with 
 correlation id 49939 from client ReplicaFetcherThread-0-6 into purgatory 
 (kafka.server.KafkaApis)
 {noformat}
 The volume of these lines seemed to match (approximately) the fetch purgatory 
 growth on that broker.
 At this point we developed a hypothesis (detailed below) which guided our 
 subsequent debugging tests:
 - Setting a daemon up to produce regular random data to all of the topics led 
 by kafka06 (specifically the ones which otherwise would receive no data) 
 substantially alleviated the problem.
 - Doing an additional rebalance of the cluster in order to move a number of 
 other topics with regular data to kafka06 appears to have solved the problem 
 completely.
 h4. Hypothesis
 Current versions (0.8.2.1 and earlier) have issues with the replica fetcher 
 not backing off correctly (KAFKA-1461, KAFKA-2082 and others). I believe that 
 in a very specific situation, the replica fetcher thread of one broker can 
 spam another broker with requests that fill up its purgatory and do not get 
 properly flushed. My