Re: Copy availability when broker goes down?

2013-03-04 Thread Chris Curtin
I'll grab HEAD in a few minutes and see if the changes.

Issues submitted:

https://issues.apache.org/jira/browse/KAFKA-783

https://issues.apache.org/jira/browse/KAFKA-782

Thanks,

Chris


On Mon, Mar 4, 2013 at 1:15 PM, Jun Rao  wrote:

> Chris,
>
> As Neha said, the 1st copy of a partition is the preferred replica and we
> try to spread them evenly across the brokers. When a broker is restarted,
> we don't automatically move the leader back to the preferred replica
> though. You will have to run a command line
> tool PreferredReplicaLeaderElectionCommand to balance the leaders again.
>
> Also, I recommend that you try the latest code in 0.8. A bunch of issues
> have been fixes since Jan. You will have to wipe out all your ZK and Kafka
> data first though.
>
> Thanks,
>
> Jun
>
> On Mon, Mar 4, 2013 at 8:32 AM, Chris Curtin 
> wrote:
>
> > Hi,
> >
> > (Hmm, take 2. Apache's spam filter doesn't like the word to describe the
> > copy of the data. 'R - E -P -L -I -C -A' so it blocked it from sending!
> > Using 'copy' below to mean that concept)
> >
> > I’m running 0.8.0 with HEAD from end of January (not the merge you guys
> did
> > last night).
> >
> > I’m testing how the producer responds to loss of brokers, what errors are
> > produced etc. and noticed some strange things as I shutdown servers in my
> > cluster.
> >
> > Setup:
> > 4 node cluster
> > 1 topic, 3 copies in the set
> > 10 partitions numbered 0-9
> >
> > State of the cluster is determined using TopicMetadataRequest.
> >
> > When I start with a full cluster (2nd column is the partition id, next is
> > leader, then the copy set and ISR):
> >
> > Java: 0:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> > Java: 1:vrd04.atlnp1 R:[  vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[
> > vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> > Java: 2:vrd03.atlnp1 R:[  vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> > vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> > Java: 3:vrd03.atlnp1 R:[  vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> > Java: 4:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[
> > vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> > Java: 5:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> > Java: 6:vrd03.atlnp1 R:[  vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> > Java: 7:vrd04.atlnp1 R:[  vrd02.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[
> > vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> > Java: 8:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> > Java: 9:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd03.atlnp1 vrd01.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> >
> > When I stop vrd01, which isn’t leader on any:
> >
> > Java: 0:vrd03.atlnp1 R:[ ] I:[]
> > Java: 1:vrd04.atlnp1 R:[ ] I:[]
> > Java: 2:vrd03.atlnp1 R:[ ] I:[]
> > Java: 3:vrd03.atlnp1 R:[  vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> > Java: 4:vrd03.atlnp1 R:[ ] I:[]
> > Java: 5:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> > Java: 6:vrd03.atlnp1 R:[ ] I:[]
> > Java: 7:vrd04.atlnp1 R:[ ] I:[]
> > Java: 8:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[
> > vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> > Java: 9:vrd03.atlnp1 R:[ ] I:[]
> >
> > Does this mean that none of the partitions that used to have a copy on
> > vrd01 are updating ANY of the copies?
> >
> > I ran another test, again starting with a full cluster and all partitions
> > had a full set of copies. When I stop the broker which was leader for 9
> of
> > the 10 partitions, the leaders were all elected on one machine instead of
> > the set of 3. Should the leaders have been better spread out? Also the
> > copies weren’t fully populated either.
> >
> > Last test: started with a full cluster, showing all copies available.
> > Stopped a broker that was not a leader for any partition. Noticed that
> the
> > partitions where the stopped machine was in the copy set didn’t show any
> > copies like above. Let the cluster sit for 30 minutes and didn’t see any
> > new copies being brought on line. How should the cluster handle a machine
> > that is down for an extended period of time?
> >
> > I don’t have a new machine I could add to the cluster, but what happens
> > when I do? Will it not be used until a new topic is added or how does it
> > become a valid option for a copy or eventually the leader?
> >
> > Thanks,
> >
> > Chris
> >
>


Re: Copy availability when broker goes down?

2013-03-04 Thread Jun Rao
Chris,

As Neha said, the 1st copy of a partition is the preferred replica and we
try to spread them evenly across the brokers. When a broker is restarted,
we don't automatically move the leader back to the preferred replica
though. You will have to run a command line
tool PreferredReplicaLeaderElectionCommand to balance the leaders again.

Also, I recommend that you try the latest code in 0.8. A bunch of issues
have been fixes since Jan. You will have to wipe out all your ZK and Kafka
data first though.

Thanks,

Jun

On Mon, Mar 4, 2013 at 8:32 AM, Chris Curtin  wrote:

> Hi,
>
> (Hmm, take 2. Apache's spam filter doesn't like the word to describe the
> copy of the data. 'R - E -P -L -I -C -A' so it blocked it from sending!
> Using 'copy' below to mean that concept)
>
> I’m running 0.8.0 with HEAD from end of January (not the merge you guys did
> last night).
>
> I’m testing how the producer responds to loss of brokers, what errors are
> produced etc. and noticed some strange things as I shutdown servers in my
> cluster.
>
> Setup:
> 4 node cluster
> 1 topic, 3 copies in the set
> 10 partitions numbered 0-9
>
> State of the cluster is determined using TopicMetadataRequest.
>
> When I start with a full cluster (2nd column is the partition id, next is
> leader, then the copy set and ISR):
>
> Java: 0:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> Java: 1:vrd04.atlnp1 R:[  vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[
> vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 2:vrd03.atlnp1 R:[  vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 3:vrd03.atlnp1 R:[  vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 4:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[
> vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 5:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 6:vrd03.atlnp1 R:[  vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> Java: 7:vrd04.atlnp1 R:[  vrd02.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[
> vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 8:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 9:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd03.atlnp1 vrd01.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
>
> When I stop vrd01, which isn’t leader on any:
>
> Java: 0:vrd03.atlnp1 R:[ ] I:[]
> Java: 1:vrd04.atlnp1 R:[ ] I:[]
> Java: 2:vrd03.atlnp1 R:[ ] I:[]
> Java: 3:vrd03.atlnp1 R:[  vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 4:vrd03.atlnp1 R:[ ] I:[]
> Java: 5:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 6:vrd03.atlnp1 R:[ ] I:[]
> Java: 7:vrd04.atlnp1 R:[ ] I:[]
> Java: 8:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 9:vrd03.atlnp1 R:[ ] I:[]
>
> Does this mean that none of the partitions that used to have a copy on
> vrd01 are updating ANY of the copies?
>
> I ran another test, again starting with a full cluster and all partitions
> had a full set of copies. When I stop the broker which was leader for 9 of
> the 10 partitions, the leaders were all elected on one machine instead of
> the set of 3. Should the leaders have been better spread out? Also the
> copies weren’t fully populated either.
>
> Last test: started with a full cluster, showing all copies available.
> Stopped a broker that was not a leader for any partition. Noticed that the
> partitions where the stopped machine was in the copy set didn’t show any
> copies like above. Let the cluster sit for 30 minutes and didn’t see any
> new copies being brought on line. How should the cluster handle a machine
> that is down for an extended period of time?
>
> I don’t have a new machine I could add to the cluster, but what happens
> when I do? Will it not be used until a new topic is added or how does it
> become a valid option for a copy or eventually the leader?
>
> Thanks,
>
> Chris
>


Re: Copy availability when broker goes down?

2013-03-04 Thread Neha Narkhede
Chris,

Thanks for reporting the issues and running those tests.

1. For problem 1, if this is the output of topic metadata request after
shutting down a broker that leads no partitions, then that is a bug. Please
can you file a bug and describe a reproducible test case there ?
2. For problem 2, we always try to make the preferred replica (1st replica
in the list of all replicas for a partition) the leader, if it is
available. We intended to spread the preferred replica for all partitions
for a topic evenly across the brokers. If this is not happening, we need to
look into it. Please can you file a bug and describe your test case there ?
3. For a machine that is down, for some time or long time, it is taken out
of ISR. When it starts back up again, it has to bootstrap from the current
leader.
4. If you have a new machine that you want to add to the cluster, you might
want to reassign some replicas for partitions to the new broker. We have a
tool (that has not been thoroughly tested yet) that allows you to do that.

Thanks,
Neha


On Mon, Mar 4, 2013 at 8:32 AM, Chris Curtin  wrote:

> Hi,
>
> (Hmm, take 2. Apache's spam filter doesn't like the word to describe the
> copy of the data. 'R - E -P -L -I -C -A' so it blocked it from sending!
> Using 'copy' below to mean that concept)
>
> I’m running 0.8.0 with HEAD from end of January (not the merge you guys did
> last night).
>
> I’m testing how the producer responds to loss of brokers, what errors are
> produced etc. and noticed some strange things as I shutdown servers in my
> cluster.
>
> Setup:
> 4 node cluster
> 1 topic, 3 copies in the set
> 10 partitions numbered 0-9
>
> State of the cluster is determined using TopicMetadataRequest.
>
> When I start with a full cluster (2nd column is the partition id, next is
> leader, then the copy set and ISR):
>
> Java: 0:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> Java: 1:vrd04.atlnp1 R:[  vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[
> vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 2:vrd03.atlnp1 R:[  vrd01.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 3:vrd03.atlnp1 R:[  vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 4:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1] I:[
> vrd03.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 5:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 6:vrd03.atlnp1 R:[  vrd01.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
> Java: 7:vrd04.atlnp1 R:[  vrd02.atlnp1 vrd04.atlnp1 vrd01.atlnp1] I:[
> vrd04.atlnp1 vrd01.atlnp1 vrd02.atlnp1]
> Java: 8:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 9:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd03.atlnp1 vrd01.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd01.atlnp1]
>
> When I stop vrd01, which isn’t leader on any:
>
> Java: 0:vrd03.atlnp1 R:[ ] I:[]
> Java: 1:vrd04.atlnp1 R:[ ] I:[]
> Java: 2:vrd03.atlnp1 R:[ ] I:[]
> Java: 3:vrd03.atlnp1 R:[  vrd02.atlnp1 vrd03.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 4:vrd03.atlnp1 R:[ ] I:[]
> Java: 5:vrd03.atlnp1 R:[  vrd04.atlnp1 vrd02.atlnp1 vrd03.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 6:vrd03.atlnp1 R:[ ] I:[]
> Java: 7:vrd04.atlnp1 R:[ ] I:[]
> Java: 8:vrd03.atlnp1 R:[  vrd03.atlnp1 vrd02.atlnp1 vrd04.atlnp1] I:[
> vrd03.atlnp1 vrd04.atlnp1 vrd02.atlnp1]
> Java: 9:vrd03.atlnp1 R:[ ] I:[]
>
> Does this mean that none of the partitions that used to have a copy on
> vrd01 are updating ANY of the copies?
>
> I ran another test, again starting with a full cluster and all partitions
> had a full set of copies. When I stop the broker which was leader for 9 of
> the 10 partitions, the leaders were all elected on one machine instead of
> the set of 3. Should the leaders have been better spread out? Also the
> copies weren’t fully populated either.
>
> Last test: started with a full cluster, showing all copies available.
> Stopped a broker that was not a leader for any partition. Noticed that the
> partitions where the stopped machine was in the copy set didn’t show any
> copies like above. Let the cluster sit for 30 minutes and didn’t see any
> new copies being brought on line. How should the cluster handle a machine
> that is down for an extended period of time?
>
> I don’t have a new machine I could add to the cluster, but what happens
> when I do? Will it not be used until a new topic is added or how does it
> become a valid option for a copy or eventually the leader?
>
> Thanks,
>
> Chris
>