Given that we are very close to the release, if we are changing the Metadata cache + topic deletion logic, I'd like a good number of system tests to appear with the patch.
On Thu, Apr 7, 2016 at 10:53 AM, Gwen Shapira <g...@confluent.io> wrote: > This will change some logic though, right? > > IIRC, right now produce/fetch requests to marked-for-deletion topics fail > because the topics are simple not around. You get a generic "doesn't exist" > error. If we keep these topics and add a flag, we'll need to find all the > places with this implicit logic and correct for it. > > And since our tests for topic deletion are clearly inadequate... I'm a bit > scared :) > > Gwen > > On Thu, Apr 7, 2016 at 10:34 AM, Guozhang Wang <wangg...@gmail.com> wrote: > >> Hmm, I think since in the original protocol, metadata response do not have >> information for "marked for deleted topics" and hence we want to remove >> that topic from returning in response by cleaning the metadata cache once >> it is marked to deletion. >> >> With the new format, I think it is OK to delay the metadata cleaning. >> >> Guozhang >> >> On Thu, Apr 7, 2016 at 8:35 AM, Grant Henke <ghe...@cloudera.com> wrote: >> >> > I am testing the marked for deletion flag in the metadata and ran into >> some >> > challenges. >> > >> > It turns out that as soon as a topic is marked for deletion it may be >> > purged from the metadata cache. This means that Metadata responses >> > can't/don't return the topic. Though the topic may still exist if its >> not >> > ready to be completely deleted or is in the process of being deleted. >> > >> > This poses a challenge because a user would have no way to tell if a >> topic >> > still exists, and is marked for deletion, other than to try and >> recreate it >> > and see a failure. I could change the logic to no longer purge a message >> > from the cache until its completely deleted, but I am not sure if that >> > would impact the clients in some way negatively. Does anyone have enough >> > background to say? >> > >> > I will dig into this a bit more today, but wanted to throw this out >> there >> > for some early feedback. >> > >> > Thank you, >> > Grant >> > >> > >> > On Tue, Apr 5, 2016 at 8:02 PM, Jun Rao <j...@confluent.io> wrote: >> > >> > > 5. You will return no error and 4,5,6 as replicas. The response also >> > > includes a list of live brokers. So the client can figure out 5 is not >> > live >> > > directly w/o relying on the error code. >> > > >> > > Thanks, >> > > >> > > Jun >> > > >> > > On Tue, Apr 5, 2016 at 5:05 PM, Grant Henke <ghe...@cloudera.com> >> wrote: >> > > >> > > > Hi Jun, >> > > > >> > > > See my responses below: >> > > > >> > > > 2. The issues that I was thinking are the following. (a) Say the >> > > controller >> > > > > has topic deletion disabled and a topic deletion request is >> submitted >> > > to >> > > > > ZK. In this case, the controller will ignore this request. >> However, >> > the >> > > > > broker may pick up the topic deletion marker in a transient >> window. >> > (b) >> > > > > Suppose that a topic is deleted and then recreated immediately. >> It is >> > > > > possible for a broker to see the newly created topic and then the >> > > > previous >> > > > > topic deletion marker in a transient window. Thinking about this a >> > bit >> > > > > more. Both seem to be transient. So, it may not be a big concern. >> > So, I >> > > > am >> > > > > ok with this as long as the interim solution is not too >> complicated. >> > > > > Another thing to think through. If a topic is marked for >> deletion, do >> > > we >> > > > > still return the partition level metadata? >> > > > >> > > > >> > > > I am not changing anything about the metadata content, only adding a >> > > > boolean based on the marked for deletion flag in zookeeper. This is >> > > > maintaining the same method that the topics script does today. I do >> > think >> > > > delete improvements should be considered/reviewed. The goal here is >> to >> > > > allow the broker to report the value that its sees, which is the >> value >> > in >> > > > zookeeper. >> > > > >> > > > 5. The issue is the following. If you have a partition with 3 >> replicas >> > > > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently, >> the >> > > > broker >> > > > > will send a REPLICA_NOT_AVAILABLE error code and only replicas >> 4,6 in >> > > the >> > > > > assigned replicas. It's more intuitive to send no error code and >> > 4,5,6 >> > > in >> > > > > the assigned replicas in this case. >> > > > >> > > > >> > > > Should the list with no error code just be 4,6 since 5 is not >> > available? >> > > > >> > > > >> > > > On Mon, Apr 4, 2016 at 1:34 PM, Jun Rao <j...@confluent.io> wrote: >> > > > >> > > > > Grant, >> > > > > >> > > > > 2. The issues that I was thinking are the following. (a) Say the >> > > > controller >> > > > > has topic deletion disabled and a topic deletion request is >> submitted >> > > to >> > > > > ZK. In this case, the controller will ignore this request. >> However, >> > the >> > > > > broker may pick up the topic deletion marker in a transient >> window. >> > (b) >> > > > > Suppose that a topic is deleted and then recreated immediately. >> It is >> > > > > possible for a broker to see the newly created topic and then the >> > > > previous >> > > > > topic deletion marker in a transient window. Thinking about this a >> > bit >> > > > > more. Both seem to be transient. So, it may not be a big concern. >> > So, I >> > > > am >> > > > > ok with this as long as the interim solution is not too >> complicated. >> > > > > Another thing to think through. If a topic is marked for >> deletion, do >> > > we >> > > > > still return the partition level metadata? >> > > > > >> > > > > 3. Your explanation on controller id seems reasonable to me. >> > > > > >> > > > > 5. The issue is the following. If you have a partition with 3 >> > replicas >> > > > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently, >> the >> > > > broker >> > > > > will send a REPLICA_NOT_AVAILABLE error code and only replicas >> 4,6 in >> > > the >> > > > > assigned replicas. It's more intuitive to send no error code and >> > 4,5,6 >> > > in >> > > > > the assigned replicas in this case. >> > > > > >> > > > > Thanks, >> > > > > >> > > > > Jun >> > > > > >> > > > > On Mon, Apr 4, 2016 at 8:33 AM, Grant Henke <ghe...@cloudera.com> >> > > wrote: >> > > > > >> > > > > > Hi Jun, >> > > > > > >> > > > > > Please See my responses below: >> > > > > > >> > > > > > Hmm, I am not sure about the listener approach. It ignores >> configs >> > > like >> > > > > > > enable.topic.deletion and also opens the door for potential >> > > ordering >> > > > > > issues >> > > > > > > since now there are two separate paths for propagating the >> > metadata >> > > > to >> > > > > > the >> > > > > > > brokers. >> > > > > > >> > > > > > >> > > > > > This mechanism is very similar to how deletes are tracked on the >> > > > > controller >> > > > > > itself. It is also the same way ACLs are tracked on brokers in >> the >> > > > > default >> > > > > > implementation. I am not sure I understand what ordering issue >> > there >> > > > > could >> > > > > > be. This is used to report what topics are marked for deletion, >> > which >> > > > > today >> > > > > > has no dependency on enable.topic.deletion. I agree that the >> delete >> > > > > > mechanism in Kafka has a lot of room for improvement, but the >> goal >> > in >> > > > > this >> > > > > > change is just to enable reporting it to the user, not to >> > fix/improve >> > > > > > existing issues. If you have an alternate approach that does not >> > > > require >> > > > > > major changes to the controller code, I would be open to >> > investigate >> > > > it. >> > > > > > >> > > > > > Could we just leave out markedForDeletion for now? In the common >> > > > > > > case, if a topic is deleted, it will only be in >> markedForDeletion >> > > > state >> > > > > > for >> > > > > > > a few milli seconds anyway. >> > > > > > >> > > > > > >> > > > > > I don't think we should leave it out. The point of these >> changes is >> > > to >> > > > > > prevent a user from needing to talk directly to zookeeper. We >> need >> > a >> > > > way >> > > > > > for a user to see if a topic has been marked for deletion. Given >> > the >> > > > > issues >> > > > > > with the current delete implementation, its fairly common for a >> > topic >> > > > to >> > > > > > remain marked as deleted for quite some time. >> > > > > > >> > > > > > Yes, for those usage, it just seems it's a bit weird for the >> client >> > > to >> > > > > > > issue a MetadataRequest to get the controller info since it >> > doesn't >> > > > > need >> > > > > > > any topic metadata. >> > > > > > >> > > > > > >> > > > > > Why does this seam weird? The MetadataRequest is the request >> used >> > to >> > > > > > discover the cluster and metadata about that cluster regardless >> of >> > > the >> > > > > > topics you are interested in, if any. In fact, a big motivation >> for >> > > the >> > > > > > change to allow requesting "no topics" is because the existing >> > > producer >> > > > > and >> > > > > > consumer often want to learn about the cluster without asking >> for >> > > topic >> > > > > > metadata and today that means that they request all topics. >> > > > > > >> > > > > > 5. The issue is that for a client, when handling a metadata >> > response, >> > > > the >> > > > > > > natural logic is if there is any error in the response, go to >> the >> > > > error >> > > > > > > handling path (e.g., back off and refresh metadata). >> Otherwise, >> > get >> > > > the >> > > > > > > leader info and initiate a request to the leader if leader is >> > > > > available. >> > > > > > If >> > > > > > > you look at the current logic in >> > > > MetadataCache.getPartitionMetadata(), >> > > > > if >> > > > > > > an assigned replica is not alive, we will send a >> > > > REPLICA_NOT_AVAILABLE >> > > > > > > error code in the response. If the client follows the above >> > logic, >> > > it >> > > > > > will >> > > > > > > keep doing the error handling even though there is nothing >> wrong >> > > with >> > > > > the >> > > > > > > leader. A better behavior is to simply return the list of >> replica >> > > ids >> > > > > > with >> > > > > > > no error code in this case. >> > > > > > >> > > > > > >> > > > > > To be sure I understand this correctly. Instead of returning the >> > > > complete >> > > > > > list of replicas, including the ones that errored as >> unavailable. >> > You >> > > > are >> > > > > > suggesting to drop the unavailable ones and return just the >> > replicas >> > > > with >> > > > > > no-errors and return no error code on the partition. Is that >> > correct? >> > > > > > >> > > > > > Under what scenario does the MetadataCache have a replica that >> is >> > not >> > > > > > available? >> > > > > > >> > > > > > Thanks, >> > > > > > Grant >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > On Mon, Apr 4, 2016 at 12:25 AM, Jun Rao <j...@confluent.io> >> wrote: >> > > > > > >> > > > > > > Grant, >> > > > > > > >> > > > > > > 2. Hmm, I am not sure about the listener approach. It ignores >> > > configs >> > > > > > like >> > > > > > > enable.topic.deletion and also opens the door for potential >> > > ordering >> > > > > > issues >> > > > > > > since now there are two separate paths for propagating the >> > metadata >> > > > to >> > > > > > the >> > > > > > > brokers. Could we just leave out markedForDeletion for now? In >> > the >> > > > > common >> > > > > > > case, if a topic is deleted, it will only be in >> markedForDeletion >> > > > state >> > > > > > for >> > > > > > > a few milli seconds anyway. >> > > > > > > >> > > > > > > 3. Yes, for those usage, it just seems it's a bit weird for >> the >> > > > client >> > > > > to >> > > > > > > issue a MetadataRequest to get the controller info since it >> > doesn't >> > > > > need >> > > > > > > any topic metadata. >> > > > > > > >> > > > > > > 5. The issue is that for a client, when handling a metadata >> > > response, >> > > > > the >> > > > > > > natural logic is if there is any error in the response, go to >> the >> > > > error >> > > > > > > handling path (e.g., back off and refresh metadata). >> Otherwise, >> > get >> > > > the >> > > > > > > leader info and initiate a request to the leader if leader is >> > > > > available. >> > > > > > If >> > > > > > > you look at the current logic in >> > > > MetadataCache.getPartitionMetadata(), >> > > > > if >> > > > > > > an assigned replica is not alive, we will send a >> > > > REPLICA_NOT_AVAILABLE >> > > > > > > error code in the response. If the client follows the above >> > logic, >> > > it >> > > > > > will >> > > > > > > keep doing the error handling even though there is nothing >> wrong >> > > with >> > > > > the >> > > > > > > leader. A better behavior is to simply return the list of >> replica >> > > ids >> > > > > > with >> > > > > > > no error code in this case. >> > > > > > > >> > > > > > > Thanks, >> > > > > > > >> > > > > > > Jun >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > On Sun, Apr 3, 2016 at 6:29 PM, Grant Henke < >> ghe...@cloudera.com >> > > >> > > > > wrote: >> > > > > > > >> > > > > > > > Responding to a few of the other comments: >> > > > > > > > >> > > > > > > > it seems that you propagated >> > > > > > > > > the topic deletion marker by having the replicaManager >> read >> > > from >> > > > ZK >> > > > > > > > > directly. It seems that it would be simpler/consistent if >> the >> > > > > > > controller >> > > > > > > > > propagates that information directly through >> > UpdateMetaRequest. >> > > > > > > > >> > > > > > > > >> > > > > > > > I was told that I should not try and modify controller logic >> > with >> > > > > KIP-4 >> > > > > > > > changes. It was indicated that a larger controller rewrite >> and >> > > > > testing >> > > > > > > was >> > > > > > > > planned and those changes should be considered then. Since >> > > marking >> > > > a >> > > > > > > topic >> > > > > > > > for deletion doesn't flow through the controller and >> therefore >> > > the >> > > > > > > > UpdateMetadataRequest, >> > > > > > > > it would take quite a bit of change. We would need to >> trigger a >> > > new >> > > > > > > > UpdateMetadataRequest >> > > > > > > > every time a new topic is marked for deletion. >> > > > > > > > >> > > > > > > > Instead I added a listener to maintain a cache of the topic >> > > > deletion >> > > > > > > znodes >> > > > > > > > in the ReplicaManager where the existing >> UpdateMetadataRequests >> > > are >> > > > > > > > handled. This would make it easy to swap out later once the >> > data >> > > > is a >> > > > > > > part >> > > > > > > > of that request and have minimal impact in the mean time. >> > > > > > > > >> > > > > > > > >> > > > > > > > > Could you add a description on how controller id will be >> used >> > > in >> > > > > the >> > > > > > > > > client? >> > > > > > > > >> > > > > > > > >> > > > > > > > I will add it to the wiki. Today metrics are the only way to >> > > access >> > > > > > this >> > > > > > > > piece of data. It is useful information about the cluster >> for >> > > many >> > > > > > > reasons. >> > > > > > > > Having programatic access to identify the controller is >> helpful >> > > for >> > > > > > > > automation. For example, It can be used during rolling >> restart >> > > > logic >> > > > > to >> > > > > > > > shutdown the controller last to prevent multiple fail overs. >> > > Beyond >> > > > > > > > automation, it can be leveraged in KIP-4 to route admin >> > requests >> > > to >> > > > > the >> > > > > > > > controller broker. >> > > > > > > > >> > > > > > > > We had a weird semantic in version 0 of MetadataRequest. If >> a >> > > > replica >> > > > > > is >> > > > > > > > > not live, but the leader is live, we return an >> > > > > > > > > error ReplicaNotAvailableException in the partition >> metadata. >> > > > This >> > > > > > > makes >> > > > > > > > it >> > > > > > > > > a bit confusing for the client to parse since it has to >> first >> > > > check >> > > > > > > > whether >> > > > > > > > > leader is available or not before error code checking. We >> > were >> > > > > > thinking >> > > > > > > > of >> > > > > > > > > changing that behavior the next time we bump up the >> version >> > of >> > > > > > > > > MetadataRequest. >> > > > > > > > > Now that time has come, could you include that in the >> > proposal? >> > > > > > > > >> > > > > > > > >> > > > > > > > I am not sure I completely follow the issue and requested >> > change. >> > > > > Could >> > > > > > > you >> > > > > > > > point me to the discussion? >> > > > > > > > >> > > > > > > > Thank you, >> > > > > > > > Grant >> > > > > > > > >> > > > > > > > On Sun, Apr 3, 2016 at 8:09 PM, Grant Henke < >> > ghe...@cloudera.com >> > > > >> > > > > > wrote: >> > > > > > > > >> > > > > > > > > Hi Jun and Ismael, >> > > > > > > > > >> > > > > > > > > Initially I had 2 booleans used to indicate if a topic was >> > > > internal >> > > > > > and >> > > > > > > > if >> > > > > > > > > a topic was marked for deletion. To save space on large >> > > > > deployments, >> > > > > > > > Ismael >> > > > > > > > > suggested I break out the internal topics and deleted >> topics >> > > into >> > > > > > their >> > > > > > > > own >> > > > > > > > > lists. The idea was that instead of 2 bytes added per >> topic, >> > in >> > > > the >> > > > > > > > general >> > > > > > > > > case the lists would be empty. Even in those lists I still >> > only >> > > > > > return >> > > > > > > > > topics that were requested. In fact on the client side >> they >> > are >> > > > > just >> > > > > > > > > utilized to translate back to booleans. I do prefer the >> > > booleans >> > > > > from >> > > > > > > an >> > > > > > > > > expressiveness standpoint but was not strongly >> opinionated on >> > > the >> > > > > > > > > structure. >> > > > > > > > > >> > > > > > > > > Thank you, >> > > > > > > > > Grant >> > > > > > > > > >> > > > > > > > > On Sun, Apr 3, 2016 at 8:00 PM, Ismael Juma < >> > ism...@juma.me.uk >> > > > >> > > > > > wrote: >> > > > > > > > > >> > > > > > > > >> Hi Jun, >> > > > > > > > >> >> > > > > > > > >> A couple of comments inline. >> > > > > > > > >> >> > > > > > > > >> On Sun, Apr 3, 2016 at 5:55 PM, Jun Rao < >> j...@confluent.io> >> > > > wrote: >> > > > > > > > >> >> > > > > > > > >> > 1. It seems a bit weird to return just a list of >> internal >> > > > topics >> > > > > > w/o >> > > > > > > > the >> > > > > > > > >> > corresponding metadata. It also seems a bit weird to >> > return >> > > > the >> > > > > > > > internal >> > > > > > > > >> > topics even if the client doesn't ask for it. >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> Good point. >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> > Would it be better to just >> > > > > > > > >> > add a flag in topic_metadata to indicate whether it's >> an >> > > > > internal >> > > > > > > > topic >> > > > > > > > >> or >> > > > > > > > >> > not, and only include the internal topics when thy are >> > asked >> > > > (or >> > > > > > all >> > > > > > > > >> topics >> > > > > > > > >> > are requested) for? >> > > > > > > > >> > >> > > > > > > > >> >> > > > > > > > >> The disadvantage of this is that we are adding one byte >> per >> > > > topic >> > > > > > even >> > > > > > > > >> though we have a very small number of internal topics >> > > > (currently a >> > > > > > > > single >> > > > > > > > >> internal topic). It seems a bit wasteful and >> particularly so >> > > > when >> > > > > > > using >> > > > > > > > >> regex subscriptions (since we have to retrieve all >> topics in >> > > > that >> > > > > > > case). >> > > > > > > > >> >> > > > > > > > >> 2. A similar comment on topics_marked_for_deletion. >> Would it >> > > be >> > > > > > better >> > > > > > > > to >> > > > > > > > >> > only return them when asked for and just return a new >> > > > > TopicDeleted >> > > > > > > > error >> > > > > > > > >> > code in topic_metadata? >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > >> I agree that this seems better. >> > > > > > > > >> >> > > > > > > > >> Ismael >> > > > > > > > >> >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > -- >> > > > > > > > > Grant Henke >> > > > > > > > > Software Engineer | Cloudera >> > > > > > > > > gr...@cloudera.com | twitter.com/gchenke | >> > > > > > linkedin.com/in/granthenke >> > > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > -- >> > > > > > > > Grant Henke >> > > > > > > > Software Engineer | Cloudera >> > > > > > > > gr...@cloudera.com | twitter.com/gchenke | >> > > > > linkedin.com/in/granthenke >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > -- >> > > > > > Grant Henke >> > > > > > Software Engineer | Cloudera >> > > > > > gr...@cloudera.com | twitter.com/gchenke | >> > > linkedin.com/in/granthenke >> > > > > > >> > > > > >> > > > >> > > > >> > > > >> > > > -- >> > > > Grant Henke >> > > > Software Engineer | Cloudera >> > > > gr...@cloudera.com | twitter.com/gchenke | >> linkedin.com/in/granthenke >> > > > >> > > >> > >> > >> > >> > -- >> > Grant Henke >> > Software Engineer | Cloudera >> > gr...@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke >> > >> >> >> >> -- >> -- Guozhang >> > >