[jira] [Comment Edited] (KAFKA-1546) Automate replica lag tuning

2015-02-19 Thread Aditya Auradkar (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14328427#comment-14328427
 ] 

Aditya Auradkar edited comment on KAFKA-1546 at 2/20/15 1:30 AM:
-

I agree we should model this in terms of time and not in terms of messages. 
While I think it is a bit more natural to model replication lag in terms of 
will take more than N ms to catch up., I also agree it is tricky to implement 
correctly. 

One possible way to model it is to associate an offset with a commit timestamp 
at the source. For example, assume that a message with offset O is produced on 
the leader for partition X at timestamp T1. If the time now is T2 and a 
replica's log end offset is O  (i.e. it is has consumed till O), then the lag 
can be (T2-T1). Is there any easy way to obtain the source timestamp given an 
offset? 

If this isn't feasible, then I do think that the heuristic proposed in Neha's 
comment is a good one.. and I will submit a patch for it.

Also, there are currently 2 checks for replica lag (in ISR).
a. keepInSyncMessages - This tracks replica lag as a function of the number of 
messages it is trailing behind. I believe we will remove this entirely 
regardless of the approach we choose.
b. keepInSyncTimeMs - This tracks the amount of time between fetch requests. I 
think we can remove this as well.




was (Author: aauradkar):
I agree we should model this in terms of time and not in terms of messages. 
While I think it is a bit more natural to model replication lag in terms of 
will take more than N ms to catch up., I also agree it is tricky to implement 
correctly. 

One possible way to model it is to associate an offset with a commit timestamp 
at the source. For example, assume that a message with offset O is produced on 
the leader for partition X at timestamp T1. If the time now is T2 and a 
replica's log end offset is O  (i.e. it is has consumed till O), then the lag 
can be (T2-T1). Is there any easy way to obtain the source timestamp given an 
offset? 

If this isn't feasible, then I do think that the originally proposed heuristic 
is a good one.. and I will submit a patch for it.

Also, there are currently 2 checks for replica lag (in ISR).
a. keepInSyncMessages - This tracks replica lag as a function of the number of 
messages it is trailing behind. I believe we will remove this entirely 
regardless of the approach we choose.
b. keepInSyncTimeMs - This tracks the amount of time between fetch requests. I 
think we can remove this as well.



 Automate replica lag tuning
 ---

 Key: KAFKA-1546
 URL: https://issues.apache.org/jira/browse/KAFKA-1546
 Project: Kafka
  Issue Type: Improvement
  Components: replication
Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
Reporter: Neha Narkhede
Assignee: Aditya Auradkar
  Labels: newbie++

 Currently, there is no good way to tune the replica lag configs to 
 automatically account for high and low volume topics on the same cluster. 
 For the low-volume topic it will take a very long time to detect a lagging
 replica, and for the high-volume topic it will have false-positives.
 One approach to making this easier would be to have the configuration
 be something like replica.lag.max.ms and translate this into a number
 of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (KAFKA-1546) Automate replica lag tuning

2014-07-21 Thread Jay Kreps (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068575#comment-14068575
 ] 

Jay Kreps edited comment on KAFKA-1546 at 7/21/14 2:41 PM:
---

I think I was being a little vague. What I was trying to say is this. When each 
fetch is serviced we check
{code}
  if(!fetchedData.readToEndOfLog)
 this.lagBegin = System.currentTimeMillis()
  else
 this.lagBegin = -1
{code}
Then the liveness criteria is
{code}
 partitionLagging = this.lagBegin  0  System.currentTimeMillis() - 
this.lagBegin  REPLICA_LAG_TIME_MS
{code}


was (Author: jkreps):
I think I was being a little vague. What I was trying to say is this. When each 
fetch is serviced we check
{code}
  if(fetchedData.size  maxSize)
 this.lagBegin = System.currentTimeMillis()
  else
 this.lagBegin = -1
{code}
Then the liveness criteria is
{code}
 partitionLagging = this.lagBegin  0  System.currentTimeMillis() - 
this.lagBegin  REPLICA_LAG_TIME_MS
{code}

 Automate replica lag tuning
 ---

 Key: KAFKA-1546
 URL: https://issues.apache.org/jira/browse/KAFKA-1546
 Project: Kafka
  Issue Type: Improvement
  Components: replication
Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
Reporter: Neha Narkhede
  Labels: newbie++

 Currently, there is no good way to tune the replica lag configs to 
 automatically account for high and low volume topics on the same cluster. 
 For the low-volume topic it will take a very long time to detect a lagging
 replica, and for the high-volume topic it will have false-positives.
 One approach to making this easier would be to have the configuration
 be something like replica.lag.max.ms and translate this into a number
 of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.2#6252)