Re: [VOTE] KIP-4 Metadata Schema

Guozhang Wang Thu, 07 Apr 2016 10:35:26 -0700

Hmm, I think since in the original protocol, metadata response do not have
information for "marked for deleted topics" and hence we want to remove
that topic from returning in response by cleaning the metadata cache once
it is marked to deletion.


With the new format, I think it is OK to delay the metadata cleaning.

Guozhang

On Thu, Apr 7, 2016 at 8:35 AM, Grant Henke <[email protected]> wrote:

> I am testing the marked for deletion flag in the metadata and ran into some
> challenges.
>
> It turns out that as soon as a topic is marked for deletion it may be
> purged from the metadata cache. This means that Metadata responses
> can't/don't return the topic. Though the topic may still exist if its not
> ready to be completely deleted or is in the process of being deleted.
>
> This poses a challenge because a user would have no way to tell if a topic
> still exists, and is marked for deletion, other than to try and recreate it
> and see a failure. I could change the logic to no longer purge a message
> from the cache until its completely deleted, but I am not sure if that
> would impact the clients in some way negatively. Does anyone have enough
> background to say?
>
> I will dig into this a bit more today, but wanted to throw this out there
> for some early feedback.
>
> Thank you,
> Grant
>
>
> On Tue, Apr 5, 2016 at 8:02 PM, Jun Rao <[email protected]> wrote:
>
> > 5. You will return no error and 4,5,6 as replicas. The response also
> > includes a list of live brokers. So the client can figure out 5 is not
> live
> > directly w/o relying on the error code.
> >
> > Thanks,
> >
> > Jun
> >
> > On Tue, Apr 5, 2016 at 5:05 PM, Grant Henke <[email protected]> wrote:
> >
> > > Hi Jun,
> > >
> > > See my responses below:
> > >
> > > 2. The issues that I was thinking are the following. (a) Say the
> > controller
> > > > has topic deletion disabled and a topic deletion request is submitted
> > to
> > > > ZK. In this case, the controller will ignore this request. However,
> the
> > > > broker may pick up the topic deletion marker in a transient window.
> (b)
> > > > Suppose that a topic is deleted and then recreated immediately. It is
> > > > possible for a broker to see the newly created topic and then the
> > > previous
> > > > topic deletion marker in a transient window. Thinking about this a
> bit
> > > > more. Both seem to be transient. So, it may not be a big concern.
> So, I
> > > am
> > > > ok with this as long as the interim solution is not too complicated.
> > > > Another thing to think through. If a topic is marked for deletion, do
> > we
> > > > still return the partition level metadata?
> > >
> > >
> > > I am not changing anything about the metadata content, only adding a
> > > boolean based on the marked for deletion flag in zookeeper. This is
> > > maintaining the same method that the topics script does today. I do
> think
> > > delete improvements should be considered/reviewed. The goal here is to
> > > allow the broker to report the value that its sees, which is the value
> in
> > > zookeeper.
> > >
> > > 5. The issue is the following. If you have a partition with 3 replicas
> > > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the
> > > broker
> > > > will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in
> > the
> > > > assigned replicas. It's more intuitive to send no error code and
> 4,5,6
> > in
> > > > the assigned replicas in this case.
> > >
> > >
> > > Should the list with no error code just be 4,6 since 5 is not
> available?
> > >
> > >
> > > On Mon, Apr 4, 2016 at 1:34 PM, Jun Rao <[email protected]> wrote:
> > >
> > > > Grant,
> > > >
> > > > 2. The issues that I was thinking are the following. (a) Say the
> > > controller
> > > > has topic deletion disabled and a topic deletion request is submitted
> > to
> > > > ZK. In this case, the controller will ignore this request. However,
> the
> > > > broker may pick up the topic deletion marker in a transient window.
> (b)
> > > > Suppose that a topic is deleted and then recreated immediately. It is
> > > > possible for a broker to see the newly created topic and then the
> > > previous
> > > > topic deletion marker in a transient window. Thinking about this a
> bit
> > > > more. Both seem to be transient. So, it may not be a big concern.
> So, I
> > > am
> > > > ok with this as long as the interim solution is not too complicated.
> > > > Another thing to think through. If a topic is marked for deletion, do
> > we
> > > > still return the partition level metadata?
> > > >
> > > > 3. Your explanation on controller id seems reasonable to me.
> > > >
> > > > 5. The issue is the following. If you have a partition with 3
> replicas
> > > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently, the
> > > broker
> > > > will send a REPLICA_NOT_AVAILABLE error code and only replicas 4,6 in
> > the
> > > > assigned replicas. It's more intuitive to send no error code and
> 4,5,6
> > in
> > > > the assigned replicas in this case.
> > > >
> > > > Thanks,
> > > >
> > > > Jun
> > > >
> > > > On Mon, Apr 4, 2016 at 8:33 AM, Grant Henke <[email protected]>
> > wrote:
> > > >
> > > > > Hi Jun,
> > > > >
> > > > > Please See my responses below:
> > > > >
> > > > > Hmm, I am not sure about the listener approach. It ignores configs
> > like
> > > > > > enable.topic.deletion and also opens the door for potential
> > ordering
> > > > > issues
> > > > > > since now there are two separate paths for propagating the
> metadata
> > > to
> > > > > the
> > > > > > brokers.
> > > > >
> > > > >
> > > > > This mechanism is very similar to how deletes are tracked on the
> > > > controller
> > > > > itself. It is also the same way ACLs are tracked on brokers in the
> > > > default
> > > > > implementation. I am not sure I understand what ordering issue
> there
> > > > could
> > > > > be. This is used to report what topics are marked for deletion,
> which
> > > > today
> > > > > has no dependency on enable.topic.deletion. I agree that the delete
> > > > > mechanism in Kafka has a lot of room for improvement, but the goal
> in
> > > > this
> > > > > change is just to enable reporting it to the user, not to
> fix/improve
> > > > > existing issues. If you have an alternate approach that does not
> > > require
> > > > > major changes to the controller code, I would be open to
> investigate
> > > it.
> > > > >
> > > > > Could we just leave out markedForDeletion for now? In the common
> > > > > > case, if a topic is deleted, it will only be in markedForDeletion
> > > state
> > > > > for
> > > > > > a few milli seconds anyway.
> > > > >
> > > > >
> > > > > I don't think we should leave it out. The point of these changes is
> > to
> > > > > prevent a user from needing to talk directly to zookeeper. We need
> a
> > > way
> > > > > for a user to see if a topic has been marked for deletion. Given
> the
> > > > issues
> > > > > with the current delete implementation, its fairly common for a
> topic
> > > to
> > > > > remain marked as deleted for quite some time.
> > > > >
> > > > > Yes, for those usage, it just seems it's a bit weird for the client
> > to
> > > > > > issue a MetadataRequest to get the controller info since it
> doesn't
> > > > need
> > > > > > any topic metadata.
> > > > >
> > > > >
> > > > > Why does this seam weird? The MetadataRequest is the request used
> to
> > > > > discover the cluster and metadata about that cluster regardless of
> > the
> > > > > topics you are interested in, if any. In fact, a big motivation for
> > the
> > > > > change to allow requesting "no topics" is because the existing
> > producer
> > > > and
> > > > > consumer often want to learn about the cluster without asking for
> > topic
> > > > > metadata and today that means that they request all topics.
> > > > >
> > > > > 5. The issue is that for a client, when handling a metadata
> response,
> > > the
> > > > > > natural logic is if there is any error in the response, go to the
> > > error
> > > > > > handling path (e.g., back off and refresh metadata). Otherwise,
> get
> > > the
> > > > > > leader info and initiate a request to the leader if leader is
> > > > available.
> > > > > If
> > > > > > you look at the current logic in
> > > MetadataCache.getPartitionMetadata(),
> > > > if
> > > > > > an assigned replica is not alive, we will send a
> > > REPLICA_NOT_AVAILABLE
> > > > > > error code in the response. If the client follows the above
> logic,
> > it
> > > > > will
> > > > > > keep doing the error handling even though there is nothing wrong
> > with
> > > > the
> > > > > > leader. A better behavior is to simply return the list of replica
> > ids
> > > > > with
> > > > > > no error code in this case.
> > > > >
> > > > >
> > > > > To be sure I understand this correctly. Instead of returning the
> > > complete
> > > > > list of replicas, including the ones that errored as unavailable.
> You
> > > are
> > > > > suggesting to drop the unavailable ones and return just the
> replicas
> > > with
> > > > > no-errors and return no error code on the partition. Is that
> correct?
> > > > >
> > > > > Under what scenario does the MetadataCache have a replica that is
> not
> > > > > available?
> > > > >
> > > > > Thanks,
> > > > > Grant
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Apr 4, 2016 at 12:25 AM, Jun Rao <[email protected]> wrote:
> > > > >
> > > > > > Grant,
> > > > > >
> > > > > > 2. Hmm, I am not sure about the listener approach. It ignores
> > configs
> > > > > like
> > > > > > enable.topic.deletion and also opens the door for potential
> > ordering
> > > > > issues
> > > > > > since now there are two separate paths for propagating the
> metadata
> > > to
> > > > > the
> > > > > > brokers. Could we just leave out markedForDeletion for now? In
> the
> > > > common
> > > > > > case, if a topic is deleted, it will only be in markedForDeletion
> > > state
> > > > > for
> > > > > > a few milli seconds anyway.
> > > > > >
> > > > > > 3. Yes, for those usage, it just seems it's a bit weird for the
> > > client
> > > > to
> > > > > > issue a MetadataRequest to get the controller info since it
> doesn't
> > > > need
> > > > > > any topic metadata.
> > > > > >
> > > > > > 5. The issue is that for a client, when handling a metadata
> > response,
> > > > the
> > > > > > natural logic is if there is any error in the response, go to the
> > > error
> > > > > > handling path (e.g., back off and refresh metadata). Otherwise,
> get
> > > the
> > > > > > leader info and initiate a request to the leader if leader is
> > > > available.
> > > > > If
> > > > > > you look at the current logic in
> > > MetadataCache.getPartitionMetadata(),
> > > > if
> > > > > > an assigned replica is not alive, we will send a
> > > REPLICA_NOT_AVAILABLE
> > > > > > error code in the response. If the client follows the above
> logic,
> > it
> > > > > will
> > > > > > keep doing the error handling even though there is nothing wrong
> > with
> > > > the
> > > > > > leader. A better behavior is to simply return the list of replica
> > ids
> > > > > with
> > > > > > no error code in this case.
> > > > > >
> > > > > > Thanks,
> > > > > >
> > > > > > Jun
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Sun, Apr 3, 2016 at 6:29 PM, Grant Henke <[email protected]
> >
> > > > wrote:
> > > > > >
> > > > > > > Responding to a few of the other comments:
> > > > > > >
> > > > > > > it seems that you propagated
> > > > > > > > the topic deletion marker by having the replicaManager read
> > from
> > > ZK
> > > > > > > > directly. It seems that it would be simpler/consistent if the
> > > > > > controller
> > > > > > > > propagates that information directly through
> UpdateMetaRequest.
> > > > > > >
> > > > > > >
> > > > > > > I was told that I should not try and modify controller logic
> with
> > > > KIP-4
> > > > > > > changes. It was indicated that a larger controller rewrite and
> > > > testing
> > > > > > was
> > > > > > > planned and those changes should be considered then. Since
> > marking
> > > a
> > > > > > topic
> > > > > > > for deletion doesn't flow through the controller and therefore
> > the
> > > > > > > UpdateMetadataRequest,
> > > > > > > it would take quite a bit of change. We would need to trigger a
> > new
> > > > > > > UpdateMetadataRequest
> > > > > > > every time a new topic is marked for deletion.
> > > > > > >
> > > > > > > Instead I added a listener to maintain a cache of the topic
> > > deletion
> > > > > > znodes
> > > > > > > in the ReplicaManager where the existing UpdateMetadataRequests
> > are
> > > > > > > handled. This would make it easy to swap out later once the
> data
> > > is a
> > > > > > part
> > > > > > > of that request and have minimal impact in the mean time.
> > > > > > >
> > > > > > >
> > > > > > > > Could you add a description on how controller id will be used
> > in
> > > > the
> > > > > > > > client?
> > > > > > >
> > > > > > >
> > > > > > > I will add it to the wiki. Today metrics are the only way to
> > access
> > > > > this
> > > > > > > piece of data. It is useful information about the cluster for
> > many
> > > > > > reasons.
> > > > > > > Having programatic access to identify the controller is helpful
> > for
> > > > > > > automation. For example, It can be used during rolling restart
> > > logic
> > > > to
> > > > > > > shutdown the controller last to prevent multiple fail overs.
> > Beyond
> > > > > > > automation, it can be leveraged in KIP-4 to route admin
> requests
> > to
> > > > the
> > > > > > > controller broker.
> > > > > > >
> > > > > > > We had a weird semantic in version 0 of MetadataRequest. If a
> > > replica
> > > > > is
> > > > > > > > not live, but the leader is live, we return an
> > > > > > > > error ReplicaNotAvailableException in the partition metadata.
> > > This
> > > > > > makes
> > > > > > > it
> > > > > > > > a bit confusing for the client to parse since it has to first
> > > check
> > > > > > > whether
> > > > > > > > leader is available or not before error code checking. We
> were
> > > > > thinking
> > > > > > > of
> > > > > > > > changing that behavior the next time we bump up the version
> of
> > > > > > > > MetadataRequest.
> > > > > > > > Now that time has come, could you include that in the
> proposal?
> > > > > > >
> > > > > > >
> > > > > > > I am not sure I completely follow the issue and requested
> change.
> > > > Could
> > > > > > you
> > > > > > > point me to the discussion?
> > > > > > >
> > > > > > > Thank you,
> > > > > > > Grant
> > > > > > >
> > > > > > > On Sun, Apr 3, 2016 at 8:09 PM, Grant Henke <
> [email protected]
> > >
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Jun and Ismael,
> > > > > > > >
> > > > > > > > Initially I had 2 booleans used to indicate if a topic was
> > > internal
> > > > > and
> > > > > > > if
> > > > > > > > a topic was marked for deletion. To save space on large
> > > > deployments,
> > > > > > > Ismael
> > > > > > > > suggested I break out the internal topics and deleted topics
> > into
> > > > > their
> > > > > > > own
> > > > > > > > lists. The idea was that instead of 2 bytes added per topic,
> in
> > > the
> > > > > > > general
> > > > > > > > case the lists would be empty. Even in those lists I still
> only
> > > > > return
> > > > > > > > topics that were requested. In fact on the client side they
> are
> > > > just
> > > > > > > > utilized to translate back to booleans. I do prefer the
> > booleans
> > > > from
> > > > > > an
> > > > > > > > expressiveness standpoint but was not strongly opinionated on
> > the
> > > > > > > > structure.
> > > > > > > >
> > > > > > > > Thank you,
> > > > > > > > Grant
> > > > > > > >
> > > > > > > > On Sun, Apr 3, 2016 at 8:00 PM, Ismael Juma <
> [email protected]
> > >
> > > > > wrote:
> > > > > > > >
> > > > > > > >> Hi Jun,
> > > > > > > >>
> > > > > > > >> A couple of comments inline.
> > > > > > > >>
> > > > > > > >> On Sun, Apr 3, 2016 at 5:55 PM, Jun Rao <[email protected]>
> > > wrote:
> > > > > > > >>
> > > > > > > >> > 1. It seems a bit weird to return just a list of internal
> > > topics
> > > > > w/o
> > > > > > > the
> > > > > > > >> > corresponding metadata. It also seems a bit weird to
> return
> > > the
> > > > > > > internal
> > > > > > > >> > topics even if the client doesn't ask for it.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Good point.
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> > Would it be better to just
> > > > > > > >> > add a flag in topic_metadata to indicate whether it's an
> > > > internal
> > > > > > > topic
> > > > > > > >> or
> > > > > > > >> > not, and only include the internal topics when thy are
> asked
> > > (or
> > > > > all
> > > > > > > >> topics
> > > > > > > >> > are requested) for?
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >> The disadvantage of this is that we are adding one byte per
> > > topic
> > > > > even
> > > > > > > >> though we have a very small number of internal topics
> > > (currently a
> > > > > > > single
> > > > > > > >> internal topic). It seems a bit wasteful and particularly so
> > > when
> > > > > > using
> > > > > > > >> regex subscriptions (since we have to retrieve all topics in
> > > that
> > > > > > case).
> > > > > > > >>
> > > > > > > >> 2. A similar comment on topics_marked_for_deletion. Would it
> > be
> > > > > better
> > > > > > > to
> > > > > > > >> > only return them when asked for and just return a new
> > > > TopicDeleted
> > > > > > > error
> > > > > > > >> > code in topic_metadata?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> I agree that this seems better.
> > > > > > > >>
> > > > > > > >> Ismael
> > > > > > > >>
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Grant Henke
> > > > > > > > Software Engineer | Cloudera
> > > > > > > > [email protected] | twitter.com/gchenke |
> > > > > linkedin.com/in/granthenke
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Grant Henke
> > > > > > > Software Engineer | Cloudera
> > > > > > > [email protected] | twitter.com/gchenke |
> > > > linkedin.com/in/granthenke
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Grant Henke
> > > > > Software Engineer | Cloudera
> > > > > [email protected] | twitter.com/gchenke |
> > linkedin.com/in/granthenke
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Grant Henke
> > > Software Engineer | Cloudera
> > > [email protected] | twitter.com/gchenke | linkedin.com/in/granthenke
> > >
> >
>
>
>
> --
> Grant Henke
> Software Engineer | Cloudera
> [email protected] | twitter.com/gchenke | linkedin.com/in/granthenke
>



-- 
-- Guozhang

Re: [VOTE] KIP-4 Metadata Schema

Reply via email to