[jira] [Commented] (KAFKA-6408) Kafka MirrorMaker doesn't replicate messages when .* regex is used

2018-01-01 Thread Waleed Fateem (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307393#comment-16307393
 ] 

Waleed Fateem commented on KAFKA-6408:
--

The documentation is actually wrong and I created another JIRA to get that 
corrected:
https://issues.apache.org/jira/browse/KAFKA-6301

If you use '*' you'll see in the MirrorMaker's log that it starts up with a 
blank whitelist parameter as opposed to using '.*' which shows up correctly. 

> Kafka MirrorMaker doesn't replicate messages when .* regex is used
> --
>
> Key: KAFKA-6408
> URL: https://issues.apache.org/jira/browse/KAFKA-6408
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.11.0.0
>Reporter: Waleed Fateem
>Priority: Minor
>
> When using the regular expression .* for the whitelist parameter in Kafka 
> MirrorMaker in order to mirror all topics, the MirrorMaker doesn't replicate 
> any messages. I was then able to see messages flowing again and being 
> replicated between the two Kafka clusters once I changed the whitelist 
> configuration to use another regular expression, such as 'topic1 | topic2 | 
> topic3' 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (KAFKA-6408) Kafka MirrorMaker doesn't replicate messages when .* regex is used

2018-01-01 Thread Waleed Fateem (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307393#comment-16307393
 ] 

Waleed Fateem edited comment on KAFKA-6408 at 1/1/18 10:06 AM:
---

The documentation is actually wrong and I created another JIRA to get that 
corrected:
https://issues.apache.org/jira/browse/KAFKA-6301

If you use * you'll see in the MirrorMaker's log that it starts up with a blank 
whitelist parameter as opposed to using .* which shows up correctly. 


was (Author: waleedfateem):
The documentation is actually wrong and I created another JIRA to get that 
corrected:
https://issues.apache.org/jira/browse/KAFKA-6301

If you use '*' you'll see in the MirrorMaker's log that it starts up with a 
blank whitelist parameter as opposed to using '.*' which shows up correctly. 

> Kafka MirrorMaker doesn't replicate messages when .* regex is used
> --
>
> Key: KAFKA-6408
> URL: https://issues.apache.org/jira/browse/KAFKA-6408
> Project: Kafka
>  Issue Type: Bug
>  Components: tools
>Affects Versions: 0.11.0.0
>Reporter: Waleed Fateem
>Priority: Minor
>
> When using the regular expression .* for the whitelist parameter in Kafka 
> MirrorMaker in order to mirror all topics, the MirrorMaker doesn't replicate 
> any messages. I was then able to see messages flowing again and being 
> replicated between the two Kafka clusters once I changed the whitelist 
> configuration to use another regular expression, such as 'topic1 | topic2 | 
> topic3' 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6307) mBeanName should be removed before returning from JmxReporter#removeAttribute()

2018-01-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307410#comment-16307410
 ] 

ASF GitHub Bot commented on KAFKA-6307:
---

ijuma closed pull request #4307: KAFKA-6307 mBeanName should be removed before 
returning from JmxReporter#removeAttribute()
URL: https://github.com/apache/kafka/pull/4307
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/clients/src/main/java/org/apache/kafka/common/metrics/JmxReporter.java 
b/clients/src/main/java/org/apache/kafka/common/metrics/JmxReporter.java
index 0c49224657c..063fb3b9338 100644
--- a/clients/src/main/java/org/apache/kafka/common/metrics/JmxReporter.java
+++ b/clients/src/main/java/org/apache/kafka/common/metrics/JmxReporter.java
@@ -75,6 +75,9 @@ public void init(List metrics) {
 }
 }
 
+boolean containsMbean(String mbeanName) {
+return mbeans.containsKey(mbeanName);
+}
 @Override
 public void metricChange(KafkaMetric metric) {
 synchronized (LOCK) {
@@ -86,19 +89,21 @@ public void metricChange(KafkaMetric metric) {
 @Override
 public void metricRemoval(KafkaMetric metric) {
 synchronized (LOCK) {
-KafkaMbean mbean = removeAttribute(metric);
+MetricName metricName = metric.metricName();
+String mBeanName = getMBeanName(prefix, metricName);
+KafkaMbean mbean = removeAttribute(metric, mBeanName);
 if (mbean != null) {
-if (mbean.metrics.isEmpty())
+if (mbean.metrics.isEmpty()) {
 unregister(mbean);
-else
+mbeans.remove(mBeanName);
+} else
 reregister(mbean);
 }
 }
 }
 
-private KafkaMbean removeAttribute(KafkaMetric metric) {
+private KafkaMbean removeAttribute(KafkaMetric metric, String mBeanName) {
 MetricName metricName = metric.metricName();
-String mBeanName = getMBeanName(prefix, metricName);
 KafkaMbean mbean = this.mbeans.get(mBeanName);
 if (mbean != null)
 mbean.removeAttribute(metricName.name());
diff --git 
a/clients/src/test/java/org/apache/kafka/common/metrics/JmxReporterTest.java 
b/clients/src/test/java/org/apache/kafka/common/metrics/JmxReporterTest.java
index 98e49f3abf5..28179f319a4 100644
--- a/clients/src/test/java/org/apache/kafka/common/metrics/JmxReporterTest.java
+++ b/clients/src/test/java/org/apache/kafka/common/metrics/JmxReporterTest.java
@@ -16,6 +16,7 @@
  */
 package org.apache.kafka.common.metrics;
 
+import org.apache.kafka.common.MetricName;
 import org.apache.kafka.common.metrics.stats.Avg;
 import org.apache.kafka.common.metrics.stats.Total;
 import org.junit.Test;
@@ -35,7 +36,8 @@ public void testJmxRegistration() throws Exception {
 Metrics metrics = new Metrics();
 MBeanServer server = ManagementFactory.getPlatformMBeanServer();
 try {
-metrics.addReporter(new JmxReporter());
+JmxReporter reporter = new JmxReporter();
+metrics.addReporter(reporter);
 
 assertFalse(server.isRegistered(new ObjectName(":type=grp1")));
 
@@ -48,13 +50,19 @@ public void testJmxRegistration() throws Exception {
 assertTrue(server.isRegistered(new ObjectName(":type=grp2")));
 assertEquals(0.0, server.getAttribute(new 
ObjectName(":type=grp2"), "pack.bean2.total"));
 
-metrics.removeMetric(metrics.metricName("pack.bean1.avg", "grp1"));
+MetricName metricName = metrics.metricName("pack.bean1.avg", 
"grp1");
+String mBeanName = JmxReporter.getMBeanName("", metricName);
+assertTrue(reporter.containsMbean(mBeanName));
+metrics.removeMetric(metricName);
+assertFalse(reporter.containsMbean(mBeanName));
 
 assertFalse(server.isRegistered(new ObjectName(":type=grp1")));
 assertTrue(server.isRegistered(new ObjectName(":type=grp2")));
 assertEquals(0.0, server.getAttribute(new 
ObjectName(":type=grp2"), "pack.bean2.total"));
 
-metrics.removeMetric(metrics.metricName("pack.bean2.total", 
"grp2"));
+metricName = metrics.metricName("pack.bean2.total", "grp2");
+metrics.removeMetric(metricName);
+assertFalse(reporter.containsMbean(mBeanName));
 
 assertFalse(server.isRegistered(new ObjectName(":type=grp1")));
 assertFalse(server.isRegistered(new ObjectName(":type=grp2")));


 


This is an automated message from the Apache Git Service.
To respond 

[jira] [Resolved] (KAFKA-6307) mBeanName should be removed before returning from JmxReporter#removeAttribute()

2018-01-01 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma resolved KAFKA-6307.

   Resolution: Fixed
Fix Version/s: 1.1.0

> mBeanName should be removed before returning from 
> JmxReporter#removeAttribute()
> ---
>
> Key: KAFKA-6307
> URL: https://issues.apache.org/jira/browse/KAFKA-6307
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: siva santhalingam
> Fix For: 1.1.0
>
>
> JmxReporter$KafkaMbean showed up near the top in the first histo output from 
> KAFKA-6199.
> In JmxReporter#removeAttribute() :
> {code}
> KafkaMbean mbean = this.mbeans.get(mBeanName);
> if (mbean != null)
> mbean.removeAttribute(metricName.name());
> return mbean;
> {code}
> mbeans.remove(mBeanName) should be called before returning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6307) mBeanName should be removed before returning from JmxReporter#removeAttribute()

2018-01-01 Thread Ismael Juma (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ismael Juma updated KAFKA-6307:
---
Fix Version/s: 1.0.1

> mBeanName should be removed before returning from 
> JmxReporter#removeAttribute()
> ---
>
> Key: KAFKA-6307
> URL: https://issues.apache.org/jira/browse/KAFKA-6307
> Project: Kafka
>  Issue Type: Bug
>Reporter: Ted Yu
>Assignee: siva santhalingam
> Fix For: 1.1.0, 1.0.1
>
>
> JmxReporter$KafkaMbean showed up near the top in the first histo output from 
> KAFKA-6199.
> In JmxReporter#removeAttribute() :
> {code}
> KafkaMbean mbean = this.mbeans.get(mBeanName);
> if (mbean != null)
> mbean.removeAttribute(metricName.name());
> return mbean;
> {code}
> mbeans.remove(mBeanName) should be called before returning.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6363) Use MockAdminClient for any unit tests that depend on AdminClient

2018-01-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307433#comment-16307433
 ] 

ASF GitHub Bot commented on KAFKA-6363:
---

h314to opened a new pull request #4371: KAFKA-6363: Use MockAdminClient for any 
unit tests that depend on Adm…
URL: https://github.com/apache/kafka/pull/4371
 
 
   …inClient
   
   * Implement MockAdminClient.deleteTopics
   * Use MockAdminClient instead of MockKafkaAdminClientEnv in 
StreamsResetterTest
   * Rename MockKafkaAdminClientEnv to AdminClientUnitTestEnv
   * Use MockAdminClient instead of MockKafkaAdminClientEnv in TopicAdminTest
   * Rename KafkaAdminClient to AdminClientUnitTestEnv in 
KafkaAdminClientTest.java
   * Migrate StreamThreadTest to MockAdminClient
   
   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Use MockAdminClient for any unit tests that depend on AdminClient
> -
>
> Key: KAFKA-6363
> URL: https://issues.apache.org/jira/browse/KAFKA-6363
> Project: Kafka
>  Issue Type: Bug
>Reporter: Guozhang Wang
>  Labels: newbie
>
> Today we have a few unit tests other than KafkaAdminClientTest that relies on 
> MockKafkaAdminClientEnv.
> About this class and MockKafkaAdminClientEnv, my thoughts:
> 1. MockKafkaAdminClientEnv is actually using a MockClient for the inner 
> KafkaClient; it should be only used for the unit test of KafkaAdminClient 
> itself.
> 2. For any other unit tests on classes that depend on AdminClient, we should 
> be using the MockAdminClient that mocks the whole AdminClient.
> So I suggest 1) in TopicAdminTest use MockAdminClient instead; 2) in 
> KafkaAdminClientTest use MockClient and added a new static constructor that 
> takes a KafkaClient; 3) remove the MockKafkaAdminClientEnv.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6413) ReassignPartitionsCommand#parsePartitionReassignmentData() should give better error message when JSON is malformed

2018-01-01 Thread Ted Yu (JIRA)
Ted Yu created KAFKA-6413:
-

 Summary: 
ReassignPartitionsCommand#parsePartitionReassignmentData() should give better 
error message when JSON is malformed
 Key: KAFKA-6413
 URL: https://issues.apache.org/jira/browse/KAFKA-6413
 Project: Kafka
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Minor


In this thread: 
http://search-hadoop.com/m/Kafka/uyzND1J9Hizcxo0X?subj=Partition+reassignment+data+file+is+empty
 , Allen gave an example JSON string with extra comma where 
partitionsToBeReassigned returned by 
ReassignPartitionsCommand#parsePartitionReassignmentData() was empty.

I tried the following example where a right bracket is removed:
{code}
val (partitionsToBeReassigned, replicaAssignment) = 
ReassignPartitionsCommand.parsePartitionReassignmentData(

"{\"version\":1,\"partitions\":[{\"topic\":\"metrics\",\"partition\":0,\"replicas\":[1,2]},{\"topic\":\"metrics\",\"partition\":1,\"replicas\":[2,3]},}");
{code}
The returned partitionsToBeReassigned is empty.

The parser should give better error message for malformed JSON string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6413) ReassignPartitionsCommand#parsePartitionReassignmentData() should give better error message when JSON is malformed

2018-01-01 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated KAFKA-6413:
--
Description: 
In this thread: 
http://search-hadoop.com/m/Kafka/uyzND1J9Hizcxo0X?subj=Partition+reassignment+data+file+is+empty
 , Allen gave an example JSON string with extra comma where 
partitionsToBeReassigned returned by 
ReassignPartitionsCommand#parsePartitionReassignmentData() was empty.

I tried the following example where a right bracket is removed:
{code}
val (partitionsToBeReassigned, replicaAssignment) = 
ReassignPartitionsCommand.parsePartitionReassignmentData(

"{\"version\":1,\"partitions\":[{\"topic\":\"metrics\",\"partition\":0,\"replicas\":[1,2]},{\"topic\":\"metrics\",\"partition\":1,\"replicas\":[2,3]},}");
{code}
The returned partitionsToBeReassigned is empty (and no exception was thrown).

The parser should give better error message for malformed JSON string.

  was:
In this thread: 
http://search-hadoop.com/m/Kafka/uyzND1J9Hizcxo0X?subj=Partition+reassignment+data+file+is+empty
 , Allen gave an example JSON string with extra comma where 
partitionsToBeReassigned returned by 
ReassignPartitionsCommand#parsePartitionReassignmentData() was empty.

I tried the following example where a right bracket is removed:
{code}
val (partitionsToBeReassigned, replicaAssignment) = 
ReassignPartitionsCommand.parsePartitionReassignmentData(

"{\"version\":1,\"partitions\":[{\"topic\":\"metrics\",\"partition\":0,\"replicas\":[1,2]},{\"topic\":\"metrics\",\"partition\":1,\"replicas\":[2,3]},}");
{code}
The returned partitionsToBeReassigned is empty.

The parser should give better error message for malformed JSON string.


> ReassignPartitionsCommand#parsePartitionReassignmentData() should give better 
> error message when JSON is malformed
> --
>
> Key: KAFKA-6413
> URL: https://issues.apache.org/jira/browse/KAFKA-6413
> Project: Kafka
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Minor
>
> In this thread: 
> http://search-hadoop.com/m/Kafka/uyzND1J9Hizcxo0X?subj=Partition+reassignment+data+file+is+empty
>  , Allen gave an example JSON string with extra comma where 
> partitionsToBeReassigned returned by 
> ReassignPartitionsCommand#parsePartitionReassignmentData() was empty.
> I tried the following example where a right bracket is removed:
> {code}
> val (partitionsToBeReassigned, replicaAssignment) = 
> ReassignPartitionsCommand.parsePartitionReassignmentData(
> 
> "{\"version\":1,\"partitions\":[{\"topic\":\"metrics\",\"partition\":0,\"replicas\":[1,2]},{\"topic\":\"metrics\",\"partition\":1,\"replicas\":[2,3]},}");
> {code}
> The returned partitionsToBeReassigned is empty (and no exception was thrown).
> The parser should give better error message for malformed JSON string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6412) Improve synchronization in CachingKeyValueStore methods

2018-01-01 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307468#comment-16307468
 ] 

ASF GitHub Bot commented on KAFKA-6412:
---

tedyu opened a new pull request #4372: KAFKA-6412 Improve synchronization in 
CachingKeyValueStore methods
URL: https://github.com/apache/kafka/pull/4372
 
 
   Currently CachingKeyValueStore methods are synchronized at method level.
   
   It seems we can use read lock for getter and write lock for put / delete 
methods.
   
   For getInternal(), if the underlying thread is streamThread, the 
getInternal() may trigger eviction. This can be handled by obtaining write lock 
at the beginning of the method for streamThread.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve synchronization in CachingKeyValueStore methods
> ---
>
> Key: KAFKA-6412
> URL: https://issues.apache.org/jira/browse/KAFKA-6412
> Project: Kafka
>  Issue Type: Improvement
>  Components: streams
>Reporter: Ted Yu
> Attachments: k-6412.v1.txt
>
>
> Currently CachingKeyValueStore methods are synchronized at method level.
> It seems we can use read lock for getter and write lock for put / delete 
> methods.
> For getInternal(), if the underlying thread is streamThread, the 
> getInternal() may trigger eviction. This can be handled by obtaining write 
> lock at the beginning of the method for streamThread.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6414) Inverse replication for replicas that are far behind

2018-01-01 Thread Ivan Babrou (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307608#comment-16307608
 ] 

Ivan Babrou commented on KAFKA-6414:


I think it's up to people who are going to execute this to hash out the 
details, I'm just a man with an idea.

> Inverse replication for replicas that are far behind
> 
>
> Key: KAFKA-6414
> URL: https://issues.apache.org/jira/browse/KAFKA-6414
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Reporter: Ivan Babrou
>
> Let's suppose the following starting point:
> * 1 topic
> * 1 partition
> * 1 reader
> * 24h retention period
> * leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
> reader + 1x slack = total outbound)
> In this scenario, when replica fails and needs to be brought back from 
> scratch, you can catch up at 2x inbound bandwidth (1x regular replication + 
> 1x slack used).
> 2x catch-up speed means replica will be at the point where leader is now in 
> 24h / 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
> retention cliff and will be deleted. There's absolutely to use for this data, 
> it will never be read from the replica in any scenario. And this not even 
> including the fact that we still need to replicate 12h more of data that 
> accumulated since the time we started.
> My suggestion is to refill sufficiently out of sync replicas backwards from 
> the tip: newest segments first, oldest segments last. Then we can stop when 
> we hit retention cliff and replicate far less data. The lower the ratio of 
> catch-up bandwidth to inbound bandwidth, the higher the returns would be. 
> This will also set a hard cap on retention time: it will be no higher than 
> retention period if catch-up speed if >1x (if it's less, you're forever out 
> of ISR anyway).
> What exactly "sufficiently out of sync" means in terms of lag is a topic for 
> a debate. The default segment size is 1GiB, I'd say that being >1 full 
> segments behind probably warrants this.
> As of now, the solution for slow recovery appears to be to reduce retention 
> to speed up recovery, which doesn't seem very friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (KAFKA-6414) Inverse replication for replicas that are far behind

2018-01-01 Thread Ivan Babrou (JIRA)
Ivan Babrou created KAFKA-6414:
--

 Summary: Inverse replication for replicas that are far behind
 Key: KAFKA-6414
 URL: https://issues.apache.org/jira/browse/KAFKA-6414
 Project: Kafka
  Issue Type: Bug
  Components: replication
Reporter: Ivan Babrou


Let's suppose the following starting point:

* 1 topic
* 1 partition
* 1 reader
* 24h retention period
* leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
reader + 1x slack = total outbound)

In this scenario, when replica fails and needs to be brought back from scratch, 
you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack 
used).

2x catch-up speed means replica will be at the point where leader is now in 24h 
/ 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
retention cliff and will be deleted. There's absolutely to use for this data, 
it will never be read from the replica in any scenario. And this not even 
including the fact that we still need to replicate 12h more of data that 
accumulated since the time we started.

My suggestion is to refill sufficiently out of sync replicas backwards from the 
tip: newest segments first, oldest segments last. Then we can stop when we hit 
retention cliff and replicate far less data. The lower the ratio of catch-up 
bandwidth to inbound bandwidth, the higher the returns would be. This will also 
set a hard cap on retention time: it will be no higher than retention period if 
catch-up speed if >1x (if it's less, you're forever out of ISR anyway).

What exactly "sufficiently out of sync" means in terms of lag is a topic for a 
debate. The default segment size is 1GiB, I'd say that being >1 full segments 
behind probably warrants this.

As of now, the solution for slow recovery appears to be to reduce retention to 
speed up recovery, which doesn't seem very friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (KAFKA-6414) Inverse replication for replicas that are far behind

2018-01-01 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307606#comment-16307606
 ] 

Ted Yu commented on KAFKA-6414:
---

Interesting.

Since the proposal changes the semantics of replica failover handling, does 
this need a KIP ?

> Inverse replication for replicas that are far behind
> 
>
> Key: KAFKA-6414
> URL: https://issues.apache.org/jira/browse/KAFKA-6414
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Reporter: Ivan Babrou
>
> Let's suppose the following starting point:
> * 1 topic
> * 1 partition
> * 1 reader
> * 24h retention period
> * leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
> reader + 1x slack = total outbound)
> In this scenario, when replica fails and needs to be brought back from 
> scratch, you can catch up at 2x inbound bandwidth (1x regular replication + 
> 1x slack used).
> 2x catch-up speed means replica will be at the point where leader is now in 
> 24h / 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
> retention cliff and will be deleted. There's absolutely to use for this data, 
> it will never be read from the replica in any scenario. And this not even 
> including the fact that we still need to replicate 12h more of data that 
> accumulated since the time we started.
> My suggestion is to refill sufficiently out of sync replicas backwards from 
> the tip: newest segments first, oldest segments last. Then we can stop when 
> we hit retention cliff and replicate far less data. The lower the ratio of 
> catch-up bandwidth to inbound bandwidth, the higher the returns would be. 
> This will also set a hard cap on retention time: it will be no higher than 
> retention period if catch-up speed if >1x (if it's less, you're forever out 
> of ISR anyway).
> What exactly "sufficiently out of sync" means in terms of lag is a topic for 
> a debate. The default segment size is 1GiB, I'd say that being >1 full 
> segments behind probably warrants this.
> As of now, the solution for slow recovery appears to be to reduce retention 
> to speed up recovery, which doesn't seem very friendly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (KAFKA-6414) Inverse replication for replicas that are far behind

2018-01-01 Thread Ivan Babrou (JIRA)

 [ 
https://issues.apache.org/jira/browse/KAFKA-6414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Babrou updated KAFKA-6414:
---
Description: 
Let's suppose the following starting point:

* 1 topic
* 1 partition
* 1 reader
* 24h retention period
* leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
reader + 1x slack = total outbound)

In this scenario, when replica fails and needs to be brought back from scratch, 
you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack 
used).

2x catch-up speed means replica will be at the point where leader is now in 24h 
/ 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
retention cliff and will be deleted. There's absolutely no use for this data, 
it will never be read from the replica in any scenario. And this not even 
including the fact that we still need to replicate 12h more of data that 
accumulated since the time we started.

My suggestion is to refill sufficiently out of sync replicas backwards from the 
tip: newest segments first, oldest segments last. Then we can stop when we hit 
retention cliff and replicate far less data. The lower the ratio of catch-up 
bandwidth to inbound bandwidth, the higher the returns would be. This will also 
set a hard cap on retention time: it will be no higher than retention period if 
catch-up speed if >1x (if it's less, you're forever out of ISR anyway).

What exactly "sufficiently out of sync" means in terms of lag is a topic for a 
debate. The default segment size is 1GiB, I'd say that being >1 full segments 
behind probably warrants this.

As of now, the solution for slow recovery appears to be to reduce retention to 
speed up recovery, which doesn't seem very friendly.

  was:
Let's suppose the following starting point:

* 1 topic
* 1 partition
* 1 reader
* 24h retention period
* leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
reader + 1x slack = total outbound)

In this scenario, when replica fails and needs to be brought back from scratch, 
you can catch up at 2x inbound bandwidth (1x regular replication + 1x slack 
used).

2x catch-up speed means replica will be at the point where leader is now in 24h 
/ 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
retention cliff and will be deleted. There's absolutely to use for this data, 
it will never be read from the replica in any scenario. And this not even 
including the fact that we still need to replicate 12h more of data that 
accumulated since the time we started.

My suggestion is to refill sufficiently out of sync replicas backwards from the 
tip: newest segments first, oldest segments last. Then we can stop when we hit 
retention cliff and replicate far less data. The lower the ratio of catch-up 
bandwidth to inbound bandwidth, the higher the returns would be. This will also 
set a hard cap on retention time: it will be no higher than retention period if 
catch-up speed if >1x (if it's less, you're forever out of ISR anyway).

What exactly "sufficiently out of sync" means in terms of lag is a topic for a 
debate. The default segment size is 1GiB, I'd say that being >1 full segments 
behind probably warrants this.

As of now, the solution for slow recovery appears to be to reduce retention to 
speed up recovery, which doesn't seem very friendly.


> Inverse replication for replicas that are far behind
> 
>
> Key: KAFKA-6414
> URL: https://issues.apache.org/jira/browse/KAFKA-6414
> Project: Kafka
>  Issue Type: Bug
>  Components: replication
>Reporter: Ivan Babrou
>
> Let's suppose the following starting point:
> * 1 topic
> * 1 partition
> * 1 reader
> * 24h retention period
> * leader outbound bandwidth is 3x of inbound bandwidth (1x replication + 1x 
> reader + 1x slack = total outbound)
> In this scenario, when replica fails and needs to be brought back from 
> scratch, you can catch up at 2x inbound bandwidth (1x regular replication + 
> 1x slack used).
> 2x catch-up speed means replica will be at the point where leader is now in 
> 24h / 2x = 12h. However, in 12h the oldest 12h of the topic will fall out of 
> retention cliff and will be deleted. There's absolutely no use for this data, 
> it will never be read from the replica in any scenario. And this not even 
> including the fact that we still need to replicate 12h more of data that 
> accumulated since the time we started.
> My suggestion is to refill sufficiently out of sync replicas backwards from 
> the tip: newest segments first, oldest segments last. Then we can stop when 
> we hit retention cliff and replicate far less data. The lower the ratio of 
> catch-up bandwidth to inbound bandwidth, the higher the returns would be. 
> This will also set a hard cap on retention time: it will be no higher than