[ 
https://issues.apache.org/jira/browse/KAFKA-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192931#comment-15192931
 ] 

jiang tao commented on KAFKA-2165:
----------------------------------

@Jun Rao  Recently,our production cluster have suddenly high network 
flow,increased 50M/s. after some days analyse, I find is the replication flow.

the cause is the follower replica offset is larger than master,and the follower 
replica offset  have been reset of the most begin,then the follower send 
another fetch request with the new offset,cause the follower out of ISR. and 
because the partition is also 100G,the follower make almost 1 hour to catch 
up,during the replication, the broker network load is very high,almost 
saturated.

> ReplicaFetcherThread: data loss on unknown exception
> ----------------------------------------------------
>
>                 Key: KAFKA-2165
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2165
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8.2.1
>            Reporter: Alexey Ozeritskiy
>         Attachments: KAFKA-2165.patch
>
>
> Sometimes in our cluster some replica gets out of the isr. Then broker 
> redownloads the partition from the beginning. We got the following messages 
> in logs:
> {code}
> # The leader:
> [2015-03-25 11:11:07,796] ERROR [Replica Manager on Broker 21]: Error when 
> processing fetch request for partition [topic,11] offset 54369274 from 
> follower with correlation id 2634499. Possible cause: Request for offset 
> 54369274 but we only have log segments in the range 49322124 to 54369273. 
> (kafka.server.ReplicaManager)
> {code}
> {code}
> # The follower:
> [2015-03-25 11:11:08,816] WARN [ReplicaFetcherThread-0-21], Replica 31 for 
> partition [topic,11] reset its fetch offset from 49322124 to current leader 
> 21's start offset 49322124 (kafka.server.ReplicaFetcherThread)
> [2015-03-25 11:11:08,816] ERROR [ReplicaFetcherThread-0-21], Current offset 
> 54369274 for partition [topic,11] out of range; reset offset to 49322124 
> (kafka.server.ReplicaFetcherThread)
> {code}
> This occures because we update fetchOffset 
> [here|https://github.com/apache/kafka/blob/0.8.2/core/src/main/scala/kafka/server/AbstractFetcherThread.scala#L124]
>  and then try to process message. 
> If any exception except OffsetOutOfRangeCode occures we get unsynchronized 
> fetchOffset and replica.logEndOffset.
> On next fetch iteration we can get 
> fetchOffset>replica.logEndOffset==leaderEndOffset and OffsetOutOfRangeCode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to