1. Yes, this will be much easier. Okay, let's add it. 2, Okay. This will differ a little bit from the way currently kafka-topics.sh handles alter-topic command, but I think it's a reasonable restriction.
I'll update KIP acordingly to our weekly call. Thanks, Andrii Biletskyi On Mon, Apr 20, 2015 at 10:56 PM, Jun Rao <j...@confluent.io> wrote: > 1. Yes, lag is probably only going to be useful for the admin client. > However, so is isr. It seems to me that we should get lag and isr from the > same request. I was thinking that we can just extend TMR by changing > replicas from an array of int to an array of (int, lag) pairs. Is that too > complicated? > > 3. I was thinking that we just don't allow the cli to change more than one > thing at a time. So, you will get an error if you want to change both > partitions and configs. > > Thanks, > > Jun > > On Sun, Apr 19, 2015 at 8:22 AM, Andrii Biletskyi < > andrii.bilets...@stealth.ly> wrote: > > > Jun, > > > > 1. Yes, seems we can add lag info to the TMR. But before that I wonder > > whether there are other reasons we need this info except for reassign > > partition command? As we discussed earlier the problem with poor > > monitoring capabilities for reassign-partitions (as currently we only > > inform > > users Completed/In Progress per partition) may require separate solution. > > We were thinking about separate Wire protocol request. And I actually > like > > your > > idea about adding some sort of BrokerMetadataRequest for these purposes. > > I actually think we can cover some other items (like rack-awareness) but > > for > > me it deserves a separate KIP really. > > Also, adding Replica->Lag map per partition will make > TopicMetadataResponse > > very sophisticated: > > Map[TopicName, Map[PartitionId, Map[ReplicaId, Lag]]. > > Maybe we need to leave it for a moment and propose new request rather > than > > making a new step towards one "monster" request. > > > > 2. Yes, error per topic. > > The only question is whether we should execute at least the very first > > alter topic > > command from the "duplicated" topic set or return error for all ... I > think > > the more > > predictable and reasonable option for clients would be returning errors > for > > all > > duplicated topics. > > > > 3. Hm, yes. Actually we also have "change topic config" there. But it is > > not > > related to such "replication" commands as increase replicas or change > > replica > > assignment. > > This will make CLI implementation a bit strange: if user specifies > increase > > partitions and change topic config in one line - taking into account 2. > we > > will have > > to create two separate alter topic requests, which were designed as batch > > requests :), > > but probably we can live with it. > > Okay, I will think about a separate error code to cover such cases. > > > > 4. We will need InvalidArgumentTopic (e.g. contains prohibited chars), > > IAPartitions, IAReplicas, IAReplicaAssignment, IATopicConfiguration. > > A server side implementation will be a little bit messy (like dozens "if > > this then this > > error code") but maybe we should think about clients at the first place > > here. > > > > Thanks, > > Andrii Biletskyi > > > > On Fri, Apr 17, 2015 at 1:46 AM, Jun Rao <j...@confluent.io> wrote: > > > > > 1. For the lags, we can add a new field "lags" per partition. It will > > > return for each replica that's not in isr, the replica id and the lag > in > > > messages. Also, if TMR is sent to a non-leader, the response can just > > > include an empty array for isr and lags. > > > > > > 2. So, we will just return a topic level error for the duplicated > topics, > > > right? > > > > > > 3. Yes, it's true that today, one can specify both partitions and > > > replicaAssignment in the TopicCommand. However, partitions is actually > > > ignored. So, it will be clearer if we don't allow users to do this. > > > > > > 4. How many specific error codes like InvalidPartitions and > > InvalidReplicas > > > are needed? If it's not that many, giving out more specific error will > be > > > useful for non-java clients. > > > > > > Thanks, > > > > > > Jun > > > > > > > > > On Wed, Apr 15, 2015 at 10:23 AM, Andrii Biletskyi < > > > andrii.bilets...@stealth.ly> wrote: > > > > > > > Guys, > > > > > > > > Thanks for the discussion! > > > > > > > > Summary: > > > > > > > > 1. Q: How KAFKA-1367 (isr is inconsistent in brokers' metadata cache) > > can > > > > affect implementation? > > > > A: We can fix this issue for the leading broker - ReplicaManager > > (or > > > > Partition) > > > > component should have accurate isr list, then with leading > > broker > > > > having correct > > > > info, to do a describe-topic we will need to define leading > > > brokers > > > > for partitions > > > > and ask those for a correct isr list. > > > > Also, we should consider adding lag information to TMR for > each > > > > follower for > > > > partition reassignment, as Jun suggested above. > > > > > > > > 2. Q: What if user adds different alter commands for the same topic > in > > > > scope > > > > of one batch request? > > > > A: Because of the async nature of AlterTopicRequest it will be > very > > > > hard then > > > > to "assemble" the expected (in terms of checking whether > > request > > > is > > > > complete) > > > > result if we let users do this. Also it will be very > confusing. > > > It > > > > was proposed not to > > > > let users do this (probably add new Error for such cases). > > > > > > > > 3. Q: AlterTopicRequest semantics: now when we merged AlterTopic and > > > > ReassingPartitons in which order AlterTopic fields should be > > > > resolved? > > > > A: This item is not clear. There was a proposal to let user > change > > > only > > > > one thing at a time, e.g. specify either new Replicas, or > > > > ReplicaAssignment. > > > > This can be a simple solution, but it's a very strict rule. > > E.g. > > > > currently with > > > > TopicCommand user can increase nr of partitions and define > > > replica > > > > assignment > > > > for newly added partitions. Taking into account item 2. this > > will > > > > be even harder > > > > for user to achieve this. > > > > > > > > 4. Q: Do we need such accurate errors returned from the server: > > > > InvalidArgumentPartitions, > > > > InvalidArgumentReplicas etc. > > > > A: I started implementation to add proposed error codes and now I > > > think > > > > probably > > > > InvalidArgumentError should be sufficient. We can do simple > > > > validations on > > > > the client side (e.g. AdminClient can ensure nr of partitions > > > > argument is positive), > > > > others - which can be covered only on server (probably > invalid > > > > topic config, > > > > replica assignment includes dead broker etc) - will be done > on > > > > server, and in case > > > > of invalid argument we will return InvalidArgumentError > without > > > > specifying the > > > > concrete field. > > > > > > > > It'd be great if we could cover these remaining issues, looks like > they > > > are > > > > minor, > > > > at least related to specific messages, not the overall protocol. - I > > > think > > > > with that I can > > > > update confluence page and update patch to reflect all discussed > items. > > > > This patch > > > > will probably include Wire protocol messages and server-side code to > > > handle > > > > new > > > > requests. AdminClient and cli-tool implementation can be the next > step. > > > > > > > > Thanks, > > > > Andrii Biletskyi > > > > > > > > On Wed, Apr 15, 2015 at 7:26 PM, Jun Rao <j...@confluent.io> wrote: > > > > > > > > > Andrii, > > > > > > > > > > 500. I think what you suggested also sounds reasonable. Since ISR > is > > > only > > > > > maintained accurately at the leader, TMR can return ISR if the > broker > > > is > > > > > the leader of a partition. Otherwise, we can return an empty ISR. > For > > > > > partition reassignment, it would be useful to know the lag of each > > > > > follower. Again, the leader knows this info. We can probably > include > > > that > > > > > info in TMR as well. > > > > > > > > > > 300. I think it's probably reasonable to restrict AlterTopicRequest > > to > > > > > change only one thing at a time, i.e., either partitions, replicas, > > > > replica > > > > > assignment or config. > > > > > > > > > > Thanks, > > > > > > > > > > Jun > > > > > > > > > > On Mon, Apr 13, 2015 at 10:56 AM, Andrii Biletskyi < > > > > > andrii.bilets...@stealth.ly> wrote: > > > > > > > > > > > Jun, > > > > > > > > > > > > 404. Great, thanks! > > > > > > > > > > > > 500. If I understand correctly KAFKA-1367 says ISR part of TMR > may > > > > > > be inconsistent. If so, then I believe all admin commands but > > > > > describeTopic > > > > > > are not affected. Let me emphasize that it's about AdminClient > > > > > operations, > > > > > > not about Wire Protocol requests. What I mean: > > > > > > To verify AdminClient.createTopic we will need (consistent) > > 'topics' > > > > set > > > > > > from TMR (we don't need isr) > > > > > > To verify alterTopic - again, probably 'topics' and 'assigned > > > > replicas' + > > > > > > configs > > > > > > To verify deleteTopic - only 'topics' > > > > > > To verify preferredReplica - 'leader', 'assigned replicas' > > > > > > To verify reassignPartitions - 'assigned replicas' ? (I'm not > sure > > > > about > > > > > > this one) > > > > > > If everything above is correct, then AdminClient.describeTopic is > > the > > > > > only > > > > > > command under risk. We can actually workaround it - find out the > > > leader > > > > > > broker > > > > > > and ask TMR that leading broker to get up-to-date isr list. > > > > > > Bottom line: looks like 1367 is a separate issue, and is not a > > > blocker > > > > > for > > > > > > this > > > > > > KIP. I'm a bit concerned about adding new requests as a must-have > > > part > > > > > > of this KIP when we don't know what we want to include to those > > > > requests. > > > > > > > > > > > > Also, I'd like to write down the new AlterTopicRequest semantics > > (if > > > we > > > > > > decide > > > > > > to include replicas there and merge it with > > > ReassignPartitionsRequest) > > > > > > 300. AlterTopicRequest => [TopicName Partitions Replicas > > > > > ReplicaAssignment > > > > > > [AddedConfigEntry] [DeletedConfig]] > > > > > > The fields are resolved in this sequence: > > > > > > 1. Either partition or replicas is defined: > > > > > > ---1.1. ReplicaAssignment is not defined - generate automatic > > replica > > > > > > assignment > > > > > > for newly added partitions or for replicas parameter > > > > increased > > > > > > ---1.2. ReplicaAssignment is defined - increase topic partitions > if > > > > > > 'partitions' defined, > > > > > > reassign partitions according to ReplicaAssignment > > > > > > 2. Neither partition nor replicas is defined: > > > > > > ---2.1. ReplicaAssignment is defined - it's a reassign replicas > > > request > > > > > > ---2.2. ReplicaAssignment is not defined - just change topic > > configs > > > > > > 3. Config fields are handled always and independently from > > > > > > partitions+replicas/replicaAssingment > > > > > > A bit sophisticated, but should cover all cases. Another option - > > we > > > > can > > > > > > say you can define either partitions+replicas or > replicaAssignment. > > > > > > > > > > > > 300.5. There is also a new question related to AlterTopicRequest > - > > > > should > > > > > > we > > > > > > allow users multiple alter-topic instructions for one topic in > one > > > > batch? > > > > > > I think if we go this way, user will expect we optimize and group > > > > > requests > > > > > > for one topic, but it will add a lot of burden, especially taken > > into > > > > > > account > > > > > > async semantics of the AlterTopicRequest. I'd rather return some > > > error > > > > > > code, > > > > > > or ignore all but first. Thoughts? > > > > > > > > > > > > Thanks, > > > > > > Andrii Biletskyi > > > > > > > > > > > > On Mon, Apr 13, 2015 at 6:34 AM, Jun Rao <j...@confluent.io> > wrote: > > > > > > > > > > > > > Andrii, > > > > > > > > > > > > > > 404. Jay and I chatted a bit. We agreed to leave > > createTopicRequest > > > > as > > > > > > > async for now. > > > > > > > > > > > > > > There is another thing. > > > > > > > > > > > > > > 500. Currently, we have this issue (KAFKA-1367) that the ISR in > > the > > > > > > > metadata cache can be out of sync. The reason is that ISR is > > really > > > > > > > maintained at the leader. We can potentially add a new > > > > > BrokerMetaRequest, > > > > > > > which will return useful stats specific to a broker. Such stats > > can > > > > > > include > > > > > > > (1) for each partition whose leader is on this broker, the ISR > > and > > > > the > > > > > > lag > > > > > > > (in messages) for each of the followers, (2) space used per > > > > partition, > > > > > > (3) > > > > > > > remaining space per log dir (not sure how easy it is to get > this > > > > info). > > > > > > If > > > > > > > we have this new request, we can probably remove the ISR part > > from > > > > TMR > > > > > > v1. > > > > > > > Currently, the producer/consumer client don't really care about > > > ISR. > > > > > The > > > > > > > admin client will then issue BrokerMetaRequest to find out ISR > > and > > > > > other > > > > > > > stats. > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > Jun > > > > > > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 12:10 PM, Andrii Biletskyi < > > > > > > > andrii.bilets...@stealth.ly> wrote: > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > A summary of our discussion: > > > > > > > > > > > > > > > > 201. Q: Cluster updates in backward compatible way. > > > > > > > > A: Add topicConfigs map property and change > > constructor, > > > > this > > > > > > > > shouldn't break Consumer/Producer since TMR is used in > > > > NetworkClient, > > > > > > > > not directly by Consumer/Producer. > > > > > > > > > > > > > > > > 300. Q: Can we merge AlterTopic and ReassignPartitions > > requests? > > > > > > > > A: It looks like in terms of Wire Protocol partition > > > > > > reassignment > > > > > > > > can > > > > > > > > be just an application of AlterTopicRequest. On the > AdminClient > > > > side > > > > > we > > > > > > > can > > > > > > > > split this into two separate methods, if needed. > > > > > > > > > > > > > > > > Some additional items that were added today: > > > > > > > > 400. Q: Do we need ListTopicsRequest, we can use TMR for this > > > > > purpose. > > > > > > > > A: The answer depends on whether we can leverage > > > ListTopics > > > > > in > > > > > > > > consumer/producer, because the only benefit of the ListTopics > > is > > > > > > > > performance > > > > > > > > optimization, otherwise it doesn't worth it. > > > > > > > > > > > > > > > > 401. Q: AdminClient.topicExists - do we need it? > > > > > > > > A: AdminClient.listTopics should be sufficient. > > > > > > > > > > > > > > > > 402. Review AdminClient API and use separate objects instead > of > > > > > > > collections > > > > > > > > for methods arguments / return results (e.g. preferredReplica > > > > accepts > > > > > > > > Map<String, List<Int>> > > > > > > > > might be better to add separate java object) > > > > > > > > > > > > > > > > 403. Error number in KIP-4 (100x). Currently there are no > > > dedicated > > > > > > > ranges > > > > > > > > for errors, we will probably continue doing it this way. > > > > > > > > > > > > > > > > 404. There were some concerns again about the asynchronous > > > > semantics > > > > > > > > of the admin requests. Jun and Jay to agree separately how we > > > want > > > > > > > > to handle it. > > > > > > > > > > > > > > > > Please add / correct me if I missed something. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > Andrii Biletskyi > > > > > > > > > > > > > > > > On Tue, Apr 7, 2015 at 4:11 PM, Andrii Biletskyi < > > > > > > > > andrii.bilets...@stealth.ly> wrote: > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > I wasn't able to send email to our thread (it says we > > exceeded > > > > > > message > > > > > > > > > size limit :)). > > > > > > > > > So I'm starting the new one. > > > > > > > > > > > > > > > > > > > > > > > > > > > Jun, > > > > > > > > > > > > > > > > > > Thanks again for the review. Answering your comments: > > > > > > > > > > > > > > > > > > 201. I'm not sure I understand how can we evolve Cluster in > > > > > backward > > > > > > > > > compatible way. In my understanding topic configs are not > > > > returned > > > > > > > > > currently - > > > > > > > > > in TMR_V0. Thus we need to add new property in Cluster - > smth > > > > like > > > > > > > > > private final Map<String, List<ConfigEntry>> topicConfigs; > > > > > > > > > Which affects Cluster constructor, which is used in > > > > > > > MetadataResponse.java > > > > > > > > > - not sure whether we can change Cluster this way so it's > > > > backward > > > > > > > > > compatible, > > > > > > > > > I suppose - no. > > > > > > > > > Let me know if I'm missing something... > > > > > > > > > > > > > > > > > > 300. Hm, so you propose to give up ReassignPartition as a > > > > separate > > > > > > > > command? > > > > > > > > > That's interesting, let's discuss it today in detail. > > > > > > > > > Two small points here: 1) afaik currently > replica-assignment > > > > > argument > > > > > > > in > > > > > > > > > alter-topic > > > > > > > > > (from TopicCommand) doesn't reassign partitions, it lets > > users > > > > > > specify > > > > > > > > > replica > > > > > > > > > assignment for newly added partition > (AddPartitionsListener) > > 2) > > > > > > > > > ReassignPartitions > > > > > > > > > command involves a little bit more than just changing > replica > > > > > > > assignment > > > > > > > > > in zk. > > > > > > > > > People are struggling with partition reassignment so I > think > > > it's > > > > > > good > > > > > > > to > > > > > > > > > have explicit > > > > > > > > > request for it so we can handle it independently, also as > > > > mentioned > > > > > > > > > earlier we'll > > > > > > > > > probably add in future some better status check procedure > for > > > > this > > > > > > > > > long-running > > > > > > > > > request. > > > > > > > > > > > > > > > > > > 301. Good point. We also agreed to use clientId as an > > > identifier > > > > > for > > > > > > > the > > > > > > > > > requester - > > > > > > > > > whether it's a producer client or admin. I think we can go > > with > > > > > -1/-1 > > > > > > > > > approach. > > > > > > > > > > > > > > > > > > 302. Again, as said above replica-assignment in alter-topic > > > > doesn't > > > > > > > > change > > > > > > > > > replica assignment of the existing partitions. But we can > > think > > > > > about > > > > > > > it > > > > > > > > > in general - > > > > > > > > > how can we change topic replication factor? The easy way - > we > > > > don't > > > > > > > need > > > > > > > > > it, > > > > > > > > > we can use reassign partitions. Not sure whether we want to > > add > > > > > > special > > > > > > > > > logic > > > > > > > > > to treat this case... > > > > > > > > > > > > > > > > > > 303.1. Okay, sure, I'll generalize topicExists(). > > > > > > > > > 303.2. I think, yes, we need separate verify methods as a > > > status > > > > > > check > > > > > > > > > procedure, > > > > > > > > > because respective requests are long running, and CLI user > > > > > > potentially > > > > > > > > > will > > > > > > > > > asynchronously call reassign-partitions, do other stuff > (e.g. > > > > > create > > > > > > > > > topics) periodically > > > > > > > > > checking status of the partition reassignment. Anyway we'll > > > have > > > > to > > > > > > > > > implement this logic > > > > > > > > > because it's a criterion of the completed Future of the > > > reassign > > > > > > > > > partitions async > > > > > > > > > call, we'll to have make those methods just public. > > > > > > > > > 303.3. If preferredReplica returns Future<Map<String, > > Errors>> > > > > than > > > > > > > what > > > > > > > > > is an error > > > > > > > > > in terms of preferred replica leader election? As I > > understand > > > we > > > > > can > > > > > > > > only > > > > > > > > > check > > > > > > > > > whether it has succeeded (leader == AR.head) or not _yet_. > > > > > > > > > > > > > > > > > > 304.1. Sure, let's add timeout to reassign/preferred > replica. > > > > > > > > > 304.2. This can be finalized after we discuss 300. > > > > > > > > > > > > > > > > > > 305. Misprints - thanks, fixed. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > Andrii Biletskyi > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >