Re: [VOTE] KIP-4 Metadata Schema

Gwen Shapira Thu, 07 Apr 2016 10:56:06 -0700

Given that we are very close to the release, if we are changing the
Metadata cache + topic deletion logic, I'd like a good number of system
tests to appear with the patch.


On Thu, Apr 7, 2016 at 10:53 AM, Gwen Shapira <[email protected]> wrote:

> This will change some logic though, right?
>
> IIRC, right now produce/fetch requests to marked-for-deletion topics fail
> because the topics are simple not around. You get a generic "doesn't exist"
> error. If we keep these topics and add a flag, we'll need to find all the
> places with this implicit logic and correct for it.
>
> And since our tests for topic deletion are clearly inadequate... I'm a bit
> scared :)
>
> Gwen
>
> On Thu, Apr 7, 2016 at 10:34 AM, Guozhang Wang <[email protected]> wrote:
>
>> Hmm, I think since in the original protocol, metadata response do not have
>> information for "marked for deleted topics" and hence we want to remove
>> that topic from returning in response by cleaning the metadata cache once
>> it is marked to deletion.
>>
>> With the new format, I think it is OK to delay the metadata cleaning.
>>
>> Guozhang
>>
>> On Thu, Apr 7, 2016 at 8:35 AM, Grant Henke <[email protected]> wrote:
>>
>> > I am testing the marked for deletion flag in the metadata and ran into
>> some
>> > challenges.
>> >
>> > It turns out that as soon as a topic is marked for deletion it may be
>> > purged from the metadata cache. This means that Metadata responses
>> > can't/don't return the topic. Though the topic may still exist if its
>> not
>> > ready to be completely deleted or is in the process of being deleted.
>> >
>> > This poses a challenge because a user would have no way to tell if a
>> topic
>> > still exists, and is marked for deletion, other than to try and
>> recreate it
>> > and see a failure. I could change the logic to no longer purge a message
>> > from the cache until its completely deleted, but I am not sure if that
>> > would impact the clients in some way negatively. Does anyone have enough
>> > background to say?
>> >
>> > I will dig into this a bit more today, but wanted to throw this out
>> there
>> > for some early feedback.
>> >
>> > Thank you,
>> > Grant
>> >
>> >
>> > On Tue, Apr 5, 2016 at 8:02 PM, Jun Rao <[email protected]> wrote:
>> >
>> > > 5. You will return no error and 4,5,6 as replicas. The response also
>> > > includes a list of live brokers. So the client can figure out 5 is not
>> > live
>> > > directly w/o relying on the error code.
>> > >
>> > > Thanks,
>> > >
>> > > Jun
>> > >
>> > > On Tue, Apr 5, 2016 at 5:05 PM, Grant Henke <[email protected]>
>> wrote:
>> > >
>> > > > Hi Jun,
>> > > >
>> > > > See my responses below:
>> > > >
>> > > > 2. The issues that I was thinking are the following. (a) Say the
>> > > controller
>> > > > > has topic deletion disabled and a topic deletion request is
>> submitted
>> > > to
>> > > > > ZK. In this case, the controller will ignore this request.
>> However,
>> > the
>> > > > > broker may pick up the topic deletion marker in a transient
>> window.
>> > (b)
>> > > > > Suppose that a topic is deleted and then recreated immediately.
>> It is
>> > > > > possible for a broker to see the newly created topic and then the
>> > > > previous
>> > > > > topic deletion marker in a transient window. Thinking about this a
>> > bit
>> > > > > more. Both seem to be transient. So, it may not be a big concern.
>> > So, I
>> > > > am
>> > > > > ok with this as long as the interim solution is not too
>> complicated.
>> > > > > Another thing to think through. If a topic is marked for
>> deletion, do
>> > > we
>> > > > > still return the partition level metadata?
>> > > >
>> > > >
>> > > > I am not changing anything about the metadata content, only adding a
>> > > > boolean based on the marked for deletion flag in zookeeper. This is
>> > > > maintaining the same method that the topics script does today. I do
>> > think
>> > > > delete improvements should be considered/reviewed. The goal here is
>> to
>> > > > allow the broker to report the value that its sees, which is the
>> value
>> > in
>> > > > zookeeper.
>> > > >
>> > > > 5. The issue is the following. If you have a partition with 3
>> replicas
>> > > > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently,
>> the
>> > > > broker
>> > > > > will send a REPLICA_NOT_AVAILABLE error code and only replicas
>> 4,6 in
>> > > the
>> > > > > assigned replicas. It's more intuitive to send no error code and
>> > 4,5,6
>> > > in
>> > > > > the assigned replicas in this case.
>> > > >
>> > > >
>> > > > Should the list with no error code just be 4,6 since 5 is not
>> > available?
>> > > >
>> > > >
>> > > > On Mon, Apr 4, 2016 at 1:34 PM, Jun Rao <[email protected]> wrote:
>> > > >
>> > > > > Grant,
>> > > > >
>> > > > > 2. The issues that I was thinking are the following. (a) Say the
>> > > > controller
>> > > > > has topic deletion disabled and a topic deletion request is
>> submitted
>> > > to
>> > > > > ZK. In this case, the controller will ignore this request.
>> However,
>> > the
>> > > > > broker may pick up the topic deletion marker in a transient
>> window.
>> > (b)
>> > > > > Suppose that a topic is deleted and then recreated immediately.
>> It is
>> > > > > possible for a broker to see the newly created topic and then the
>> > > > previous
>> > > > > topic deletion marker in a transient window. Thinking about this a
>> > bit
>> > > > > more. Both seem to be transient. So, it may not be a big concern.
>> > So, I
>> > > > am
>> > > > > ok with this as long as the interim solution is not too
>> complicated.
>> > > > > Another thing to think through. If a topic is marked for
>> deletion, do
>> > > we
>> > > > > still return the partition level metadata?
>> > > > >
>> > > > > 3. Your explanation on controller id seems reasonable to me.
>> > > > >
>> > > > > 5. The issue is the following. If you have a partition with 3
>> > replicas
>> > > > > 4,5,6, leader is on replica 4 and replica 5 is down. Currently,
>> the
>> > > > broker
>> > > > > will send a REPLICA_NOT_AVAILABLE error code and only replicas
>> 4,6 in
>> > > the
>> > > > > assigned replicas. It's more intuitive to send no error code and
>> > 4,5,6
>> > > in
>> > > > > the assigned replicas in this case.
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > Jun
>> > > > >
>> > > > > On Mon, Apr 4, 2016 at 8:33 AM, Grant Henke <[email protected]>
>> > > wrote:
>> > > > >
>> > > > > > Hi Jun,
>> > > > > >
>> > > > > > Please See my responses below:
>> > > > > >
>> > > > > > Hmm, I am not sure about the listener approach. It ignores
>> configs
>> > > like
>> > > > > > > enable.topic.deletion and also opens the door for potential
>> > > ordering
>> > > > > > issues
>> > > > > > > since now there are two separate paths for propagating the
>> > metadata
>> > > > to
>> > > > > > the
>> > > > > > > brokers.
>> > > > > >
>> > > > > >
>> > > > > > This mechanism is very similar to how deletes are tracked on the
>> > > > > controller
>> > > > > > itself. It is also the same way ACLs are tracked on brokers in
>> the
>> > > > > default
>> > > > > > implementation. I am not sure I understand what ordering issue
>> > there
>> > > > > could
>> > > > > > be. This is used to report what topics are marked for deletion,
>> > which
>> > > > > today
>> > > > > > has no dependency on enable.topic.deletion. I agree that the
>> delete
>> > > > > > mechanism in Kafka has a lot of room for improvement, but the
>> goal
>> > in
>> > > > > this
>> > > > > > change is just to enable reporting it to the user, not to
>> > fix/improve
>> > > > > > existing issues. If you have an alternate approach that does not
>> > > > require
>> > > > > > major changes to the controller code, I would be open to
>> > investigate
>> > > > it.
>> > > > > >
>> > > > > > Could we just leave out markedForDeletion for now? In the common
>> > > > > > > case, if a topic is deleted, it will only be in
>> markedForDeletion
>> > > > state
>> > > > > > for
>> > > > > > > a few milli seconds anyway.
>> > > > > >
>> > > > > >
>> > > > > > I don't think we should leave it out. The point of these
>> changes is
>> > > to
>> > > > > > prevent a user from needing to talk directly to zookeeper. We
>> need
>> > a
>> > > > way
>> > > > > > for a user to see if a topic has been marked for deletion. Given
>> > the
>> > > > > issues
>> > > > > > with the current delete implementation, its fairly common for a
>> > topic
>> > > > to
>> > > > > > remain marked as deleted for quite some time.
>> > > > > >
>> > > > > > Yes, for those usage, it just seems it's a bit weird for the
>> client
>> > > to
>> > > > > > > issue a MetadataRequest to get the controller info since it
>> > doesn't
>> > > > > need
>> > > > > > > any topic metadata.
>> > > > > >
>> > > > > >
>> > > > > > Why does this seam weird? The MetadataRequest is the request
>> used
>> > to
>> > > > > > discover the cluster and metadata about that cluster regardless
>> of
>> > > the
>> > > > > > topics you are interested in, if any. In fact, a big motivation
>> for
>> > > the
>> > > > > > change to allow requesting "no topics" is because the existing
>> > > producer
>> > > > > and
>> > > > > > consumer often want to learn about the cluster without asking
>> for
>> > > topic
>> > > > > > metadata and today that means that they request all topics.
>> > > > > >
>> > > > > > 5. The issue is that for a client, when handling a metadata
>> > response,
>> > > > the
>> > > > > > > natural logic is if there is any error in the response, go to
>> the
>> > > > error
>> > > > > > > handling path (e.g., back off and refresh metadata).
>> Otherwise,
>> > get
>> > > > the
>> > > > > > > leader info and initiate a request to the leader if leader is
>> > > > > available.
>> > > > > > If
>> > > > > > > you look at the current logic in
>> > > > MetadataCache.getPartitionMetadata(),
>> > > > > if
>> > > > > > > an assigned replica is not alive, we will send a
>> > > > REPLICA_NOT_AVAILABLE
>> > > > > > > error code in the response. If the client follows the above
>> > logic,
>> > > it
>> > > > > > will
>> > > > > > > keep doing the error handling even though there is nothing
>> wrong
>> > > with
>> > > > > the
>> > > > > > > leader. A better behavior is to simply return the list of
>> replica
>> > > ids
>> > > > > > with
>> > > > > > > no error code in this case.
>> > > > > >
>> > > > > >
>> > > > > > To be sure I understand this correctly. Instead of returning the
>> > > > complete
>> > > > > > list of replicas, including the ones that errored as
>> unavailable.
>> > You
>> > > > are
>> > > > > > suggesting to drop the unavailable ones and return just the
>> > replicas
>> > > > with
>> > > > > > no-errors and return no error code on the partition. Is that
>> > correct?
>> > > > > >
>> > > > > > Under what scenario does the MetadataCache have a replica that
>> is
>> > not
>> > > > > > available?
>> > > > > >
>> > > > > > Thanks,
>> > > > > > Grant
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > On Mon, Apr 4, 2016 at 12:25 AM, Jun Rao <[email protected]>
>> wrote:
>> > > > > >
>> > > > > > > Grant,
>> > > > > > >
>> > > > > > > 2. Hmm, I am not sure about the listener approach. It ignores
>> > > configs
>> > > > > > like
>> > > > > > > enable.topic.deletion and also opens the door for potential
>> > > ordering
>> > > > > > issues
>> > > > > > > since now there are two separate paths for propagating the
>> > metadata
>> > > > to
>> > > > > > the
>> > > > > > > brokers. Could we just leave out markedForDeletion for now? In
>> > the
>> > > > > common
>> > > > > > > case, if a topic is deleted, it will only be in
>> markedForDeletion
>> > > > state
>> > > > > > for
>> > > > > > > a few milli seconds anyway.
>> > > > > > >
>> > > > > > > 3. Yes, for those usage, it just seems it's a bit weird for
>> the
>> > > > client
>> > > > > to
>> > > > > > > issue a MetadataRequest to get the controller info since it
>> > doesn't
>> > > > > need
>> > > > > > > any topic metadata.
>> > > > > > >
>> > > > > > > 5. The issue is that for a client, when handling a metadata
>> > > response,
>> > > > > the
>> > > > > > > natural logic is if there is any error in the response, go to
>> the
>> > > > error
>> > > > > > > handling path (e.g., back off and refresh metadata).
>> Otherwise,
>> > get
>> > > > the
>> > > > > > > leader info and initiate a request to the leader if leader is
>> > > > > available.
>> > > > > > If
>> > > > > > > you look at the current logic in
>> > > > MetadataCache.getPartitionMetadata(),
>> > > > > if
>> > > > > > > an assigned replica is not alive, we will send a
>> > > > REPLICA_NOT_AVAILABLE
>> > > > > > > error code in the response. If the client follows the above
>> > logic,
>> > > it
>> > > > > > will
>> > > > > > > keep doing the error handling even though there is nothing
>> wrong
>> > > with
>> > > > > the
>> > > > > > > leader. A better behavior is to simply return the list of
>> replica
>> > > ids
>> > > > > > with
>> > > > > > > no error code in this case.
>> > > > > > >
>> > > > > > > Thanks,
>> > > > > > >
>> > > > > > > Jun
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > On Sun, Apr 3, 2016 at 6:29 PM, Grant Henke <
>> [email protected]
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > Responding to a few of the other comments:
>> > > > > > > >
>> > > > > > > > it seems that you propagated
>> > > > > > > > > the topic deletion marker by having the replicaManager
>> read
>> > > from
>> > > > ZK
>> > > > > > > > > directly. It seems that it would be simpler/consistent if
>> the
>> > > > > > > controller
>> > > > > > > > > propagates that information directly through
>> > UpdateMetaRequest.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I was told that I should not try and modify controller logic
>> > with
>> > > > > KIP-4
>> > > > > > > > changes. It was indicated that a larger controller rewrite
>> and
>> > > > > testing
>> > > > > > > was
>> > > > > > > > planned and those changes should be considered then. Since
>> > > marking
>> > > > a
>> > > > > > > topic
>> > > > > > > > for deletion doesn't flow through the controller and
>> therefore
>> > > the
>> > > > > > > > UpdateMetadataRequest,
>> > > > > > > > it would take quite a bit of change. We would need to
>> trigger a
>> > > new
>> > > > > > > > UpdateMetadataRequest
>> > > > > > > > every time a new topic is marked for deletion.
>> > > > > > > >
>> > > > > > > > Instead I added a listener to maintain a cache of the topic
>> > > > deletion
>> > > > > > > znodes
>> > > > > > > > in the ReplicaManager where the existing
>> UpdateMetadataRequests
>> > > are
>> > > > > > > > handled. This would make it easy to swap out later once the
>> > data
>> > > > is a
>> > > > > > > part
>> > > > > > > > of that request and have minimal impact in the mean time.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > Could you add a description on how controller id will be
>> used
>> > > in
>> > > > > the
>> > > > > > > > > client?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I will add it to the wiki. Today metrics are the only way to
>> > > access
>> > > > > > this
>> > > > > > > > piece of data. It is useful information about the cluster
>> for
>> > > many
>> > > > > > > reasons.
>> > > > > > > > Having programatic access to identify the controller is
>> helpful
>> > > for
>> > > > > > > > automation. For example, It can be used during rolling
>> restart
>> > > > logic
>> > > > > to
>> > > > > > > > shutdown the controller last to prevent multiple fail overs.
>> > > Beyond
>> > > > > > > > automation, it can be leveraged in KIP-4 to route admin
>> > requests
>> > > to
>> > > > > the
>> > > > > > > > controller broker.
>> > > > > > > >
>> > > > > > > > We had a weird semantic in version 0 of MetadataRequest. If
>> a
>> > > > replica
>> > > > > > is
>> > > > > > > > > not live, but the leader is live, we return an
>> > > > > > > > > error ReplicaNotAvailableException in the partition
>> metadata.
>> > > > This
>> > > > > > > makes
>> > > > > > > > it
>> > > > > > > > > a bit confusing for the client to parse since it has to
>> first
>> > > > check
>> > > > > > > > whether
>> > > > > > > > > leader is available or not before error code checking. We
>> > were
>> > > > > > thinking
>> > > > > > > > of
>> > > > > > > > > changing that behavior the next time we bump up the
>> version
>> > of
>> > > > > > > > > MetadataRequest.
>> > > > > > > > > Now that time has come, could you include that in the
>> > proposal?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I am not sure I completely follow the issue and requested
>> > change.
>> > > > > Could
>> > > > > > > you
>> > > > > > > > point me to the discussion?
>> > > > > > > >
>> > > > > > > > Thank you,
>> > > > > > > > Grant
>> > > > > > > >
>> > > > > > > > On Sun, Apr 3, 2016 at 8:09 PM, Grant Henke <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Hi Jun and Ismael,
>> > > > > > > > >
>> > > > > > > > > Initially I had 2 booleans used to indicate if a topic was
>> > > > internal
>> > > > > > and
>> > > > > > > > if
>> > > > > > > > > a topic was marked for deletion. To save space on large
>> > > > > deployments,
>> > > > > > > > Ismael
>> > > > > > > > > suggested I break out the internal topics and deleted
>> topics
>> > > into
>> > > > > > their
>> > > > > > > > own
>> > > > > > > > > lists. The idea was that instead of 2 bytes added per
>> topic,
>> > in
>> > > > the
>> > > > > > > > general
>> > > > > > > > > case the lists would be empty. Even in those lists I still
>> > only
>> > > > > > return
>> > > > > > > > > topics that were requested. In fact on the client side
>> they
>> > are
>> > > > > just
>> > > > > > > > > utilized to translate back to booleans. I do prefer the
>> > > booleans
>> > > > > from
>> > > > > > > an
>> > > > > > > > > expressiveness standpoint but was not strongly
>> opinionated on
>> > > the
>> > > > > > > > > structure.
>> > > > > > > > >
>> > > > > > > > > Thank you,
>> > > > > > > > > Grant
>> > > > > > > > >
>> > > > > > > > > On Sun, Apr 3, 2016 at 8:00 PM, Ismael Juma <
>> > [email protected]
>> > > >
>> > > > > > wrote:
>> > > > > > > > >
>> > > > > > > > >> Hi Jun,
>> > > > > > > > >>
>> > > > > > > > >> A couple of comments inline.
>> > > > > > > > >>
>> > > > > > > > >> On Sun, Apr 3, 2016 at 5:55 PM, Jun Rao <
>> [email protected]>
>> > > > wrote:
>> > > > > > > > >>
>> > > > > > > > >> > 1. It seems a bit weird to return just a list of
>> internal
>> > > > topics
>> > > > > > w/o
>> > > > > > > > the
>> > > > > > > > >> > corresponding metadata. It also seems a bit weird to
>> > return
>> > > > the
>> > > > > > > > internal
>> > > > > > > > >> > topics even if the client doesn't ask for it.
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> Good point.
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> > Would it be better to just
>> > > > > > > > >> > add a flag in topic_metadata to indicate whether it's
>> an
>> > > > > internal
>> > > > > > > > topic
>> > > > > > > > >> or
>> > > > > > > > >> > not, and only include the internal topics when thy are
>> > asked
>> > > > (or
>> > > > > > all
>> > > > > > > > >> topics
>> > > > > > > > >> > are requested) for?
>> > > > > > > > >> >
>> > > > > > > > >>
>> > > > > > > > >> The disadvantage of this is that we are adding one byte
>> per
>> > > > topic
>> > > > > > even
>> > > > > > > > >> though we have a very small number of internal topics
>> > > > (currently a
>> > > > > > > > single
>> > > > > > > > >> internal topic). It seems a bit wasteful and
>> particularly so
>> > > > when
>> > > > > > > using
>> > > > > > > > >> regex subscriptions (since we have to retrieve all
>> topics in
>> > > > that
>> > > > > > > case).
>> > > > > > > > >>
>> > > > > > > > >> 2. A similar comment on topics_marked_for_deletion.
>> Would it
>> > > be
>> > > > > > better
>> > > > > > > > to
>> > > > > > > > >> > only return them when asked for and just return a new
>> > > > > TopicDeleted
>> > > > > > > > error
>> > > > > > > > >> > code in topic_metadata?
>> > > > > > > > >>
>> > > > > > > > >>
>> > > > > > > > >> I agree that this seems better.
>> > > > > > > > >>
>> > > > > > > > >> Ismael
>> > > > > > > > >>
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > >
>> > > > > > > > > --
>> > > > > > > > > Grant Henke
>> > > > > > > > > Software Engineer | Cloudera
>> > > > > > > > > [email protected] | twitter.com/gchenke |
>> > > > > > linkedin.com/in/granthenke
>> > > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > --
>> > > > > > > > Grant Henke
>> > > > > > > > Software Engineer | Cloudera
>> > > > > > > > [email protected] | twitter.com/gchenke |
>> > > > > linkedin.com/in/granthenke
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > Grant Henke
>> > > > > > Software Engineer | Cloudera
>> > > > > > [email protected] | twitter.com/gchenke |
>> > > linkedin.com/in/granthenke
>> > > > > >
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Grant Henke
>> > > > Software Engineer | Cloudera
>> > > > [email protected] | twitter.com/gchenke |
>> linkedin.com/in/granthenke
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > Grant Henke
>> > Software Engineer | Cloudera
>> > [email protected] | twitter.com/gchenke | linkedin.com/in/granthenke
>> >
>>
>>
>>
>> --
>> -- Guozhang
>>
>
>

Re: [VOTE] KIP-4 Metadata Schema

Reply via email to