Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-29 Thread Viktor Somogyi-Vass
Hi folks,

Perhaps a solution option is to only rename partitions to
"whatever-topic-x.stray" when processing the LAIR and delete it with a
periodic task (so not with a fixed delay but have a thread which scans and
deletes them periodically). I think it has an advantage as it is a similar
approach that is used in deletion and compaction and won't cause immediate
mass deletion.

Viktor

On Thu, Jan 16, 2020 at 11:35 PM Colin McCabe  wrote:

> On Thu, Jan 16, 2020, at 10:29, Dhruvil Shah wrote:
> > Hi Colin,
> >
> > That’s fair though I am unsure if a delay + metric + log message would
> > really serve our purpose. There would be no action required from the
> > operator in almost all cases. A signal that is not actionable in 99%
> cases
> > may not be very useful, in my opinion.
>
> As I understand it, the case we're trying to solve is where a broker has
> gone away for a while and then comes back, but some of its partitions have
> been moved to a different broker.  Because this case is already relatively
> rare, I don't think we need to worry too much about adding non-actionable
> signals.
>
> Maybe more importantly, broker downtime will also independently trigger
> alerts in a well-managed cluster.  So what we are adding is a metric that
> indicates that "something bad is happening" that is highly correlated with
> other "something bad is happening" metrics.  This is similar to URPs, or
> even under-min-isr partitions, which are all worth monitoring and possibly
> alerting on, and which will all tend to show activity at the same time.
>
> >
> > Additionally, if we add in a delay, we would need to reason about the
> > behavior when the same topic is recreated while a stray partition has
> been
> > queued for deletion.
> >
>
> This is a good question, but I think the current code already handles a
> very similar case.  The broker currently handles topic deletions in a
> two-step process.  The first step is renaming the topic directory.  The
> directory's new name will contain a UUID and end with .deleted.  The second
> step is actually deleting the directory.  (It was done in this way to allow
> deletion to be done asynchronously.)  I would expect the proposed delay
> mechanism to do something like this, such that a new topic created with the
> same name would not have a name collision.
>
> > I would be in support of adding a configuration to disable stray
> partition
> > deletion. This way, if users find abnormal behavior when testing /
> > upgrading development environments, they could choose to disable the
> > feature altogether.
> >
> > Let me know what you think. It would be good to hear what others think as
> > well.
>
> I feel strongly that this should come with a delay period and advance
> warning.  We just had too much pain with lost data as a result of bugs in
> HDFS leading to rapid deletion.  These bugs didn't manifest in testing or
> routine upgrades.
>
> best,
> Colin
>
>
> >
> > Thanks,
> > Dhruvil
> >
> > On Thu, Jan 16, 2020 at 3:24 AM Colin McCabe  wrote:
> >
> > > On Wed, Jan 15, 2020, at 03:54, Dhruvil Shah wrote:
> > > > Hi Colin,
> > > >
> > > > We could add a configuration to disable stray partition deletion if
> > > needed,
> > > > but I wasn't sure if an operator would really want to disable it.
> Perhaps
> > > > if the implementation were buggy, the configuration could be used to
> > > > disable the feature until a bug fix is made. Is that the kind of use
> case
> > > > you were thinking of?
> > > >
> > > > I was thinking that there would not be any delay between detection
> and
> > > > deletion of stray logs. We would schedule an async task to do the
> actual
> > > > deletion though.
> > >
> > > Based on my experience in HDFS, immediately deleting data that looks
> out
> > > of place can cause severe issues when a bug occurs.  See
> > > https://issues.apache.org/jira/browse/HDFS-6186 for details.  So I
> really
> > > do think there should be a delay, and a metric + log message in the
> > > meantime to alert the operators to what is about to happen.
> > >
> > > best,
> > > Colin
> > >
> > > >
> > > > Thanks,
> > > > Dhruvil
> > > >
> > > > On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe 
> > > wrote:
> > > >
> > > > > Hi Dhruvil,
> > > > >
> > > > > Thanks for the KIP.  I think there should be some way to turn this
> > > off, in
> > > > > case that becomes necessary.  I'm also curious how long we intend
> to
> > > wait
> > > > > between detecting the duplication and  deleting the extra logs.
> The
> > > KIP
> > > > > says "scheduled for deletion" but doesn't give a time frame -- is
> it
> > > > > assumed to be immediate?
> > > > >
> > > > > best,
> > > > > Colin
> > > > >
> > > > >
> > > > > On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> > > > > > If there are no more questions or concerns, I will start a vote
> > > thread
> > > > > > tomorrow.
> > > > > >
> > > > > > Thanks,
> > > > > > Dhruvil
> > > > > >
> > > > > > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah <
> 

Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-16 Thread Colin McCabe
On Thu, Jan 16, 2020, at 10:29, Dhruvil Shah wrote:
> Hi Colin,
> 
> That’s fair though I am unsure if a delay + metric + log message would
> really serve our purpose. There would be no action required from the
> operator in almost all cases. A signal that is not actionable in 99% cases
> may not be very useful, in my opinion.

As I understand it, the case we're trying to solve is where a broker has gone 
away for a while and then comes back, but some of its partitions have been 
moved to a different broker.  Because this case is already relatively rare, I 
don't think we need to worry too much about adding non-actionable signals.

Maybe more importantly, broker downtime will also independently trigger alerts 
in a well-managed cluster.  So what we are adding is a metric that indicates 
that "something bad is happening" that is highly correlated with other 
"something bad is happening" metrics.  This is similar to URPs, or even 
under-min-isr partitions, which are all worth monitoring and possibly alerting 
on, and which will all tend to show activity at the same time.

> 
> Additionally, if we add in a delay, we would need to reason about the
> behavior when the same topic is recreated while a stray partition has been
> queued for deletion.
> 

This is a good question, but I think the current code already handles a very 
similar case.  The broker currently handles topic deletions in a two-step 
process.  The first step is renaming the topic directory.  The directory's new 
name will contain a UUID and end with .deleted.  The second step is actually 
deleting the directory.  (It was done in this way to allow deletion to be done 
asynchronously.)  I would expect the proposed delay mechanism to do something 
like this, such that a new topic created with the same name would not have a 
name collision.

> I would be in support of adding a configuration to disable stray partition
> deletion. This way, if users find abnormal behavior when testing /
> upgrading development environments, they could choose to disable the
> feature altogether.
> 
> Let me know what you think. It would be good to hear what others think as
> well.

I feel strongly that this should come with a delay period and advance warning.  
We just had too much pain with lost data as a result of bugs in HDFS leading to 
rapid deletion.  These bugs didn't manifest in testing or routine upgrades.

best,
Colin


> 
> Thanks,
> Dhruvil
> 
> On Thu, Jan 16, 2020 at 3:24 AM Colin McCabe  wrote:
> 
> > On Wed, Jan 15, 2020, at 03:54, Dhruvil Shah wrote:
> > > Hi Colin,
> > >
> > > We could add a configuration to disable stray partition deletion if
> > needed,
> > > but I wasn't sure if an operator would really want to disable it. Perhaps
> > > if the implementation were buggy, the configuration could be used to
> > > disable the feature until a bug fix is made. Is that the kind of use case
> > > you were thinking of?
> > >
> > > I was thinking that there would not be any delay between detection and
> > > deletion of stray logs. We would schedule an async task to do the actual
> > > deletion though.
> >
> > Based on my experience in HDFS, immediately deleting data that looks out
> > of place can cause severe issues when a bug occurs.  See
> > https://issues.apache.org/jira/browse/HDFS-6186 for details.  So I really
> > do think there should be a delay, and a metric + log message in the
> > meantime to alert the operators to what is about to happen.
> >
> > best,
> > Colin
> >
> > >
> > > Thanks,
> > > Dhruvil
> > >
> > > On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe 
> > wrote:
> > >
> > > > Hi Dhruvil,
> > > >
> > > > Thanks for the KIP.  I think there should be some way to turn this
> > off, in
> > > > case that becomes necessary.  I'm also curious how long we intend to
> > wait
> > > > between detecting the duplication and  deleting the extra logs.  The
> > KIP
> > > > says "scheduled for deletion" but doesn't give a time frame -- is it
> > > > assumed to be immediate?
> > > >
> > > > best,
> > > > Colin
> > > >
> > > >
> > > > On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> > > > > If there are no more questions or concerns, I will start a vote
> > thread
> > > > > tomorrow.
> > > > >
> > > > > Thanks,
> > > > > Dhruvil
> > > > >
> > > > > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah 
> > > > wrote:
> > > > >
> > > > > > Hi Nikhil,
> > > > > >
> > > > > > Thanks for looking at the KIP. The kind of race condition you
> > mention
> > > > is
> > > > > > not possible as stray partition detection is done synchronously
> > while
> > > > > > handling the LeaderAndIsrRequest. In other words, we atomically
> > > > evaluate
> > > > > > the partitions the broker must host and the extra partitions it is
> > > > hosting
> > > > > > and schedule deletions based on that.
> > > > > >
> > > > > > One possible shortcoming of the KIP is that we do not have the
> > ability
> > > > to
> > > > > > detect a stray partition if the topic has been recreated since. We
> 

Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-16 Thread Dhruvil Shah
Hi Colin,

That’s fair though I am unsure if a delay + metric + log message would
really serve our purpose. There would be no action required from the
operator in almost all cases. A signal that is not actionable in 99% cases
may not be very useful, in my opinion.

Additionally, if we add in a delay, we would need to reason about the
behavior when the same topic is recreated while a stray partition has been
queued for deletion.

I would be in support of adding a configuration to disable stray partition
deletion. This way, if users find abnormal behavior when testing /
upgrading development environments, they could choose to disable the
feature altogether.

Let me know what you think. It would be good to hear what others think as
well.

Thanks,
Dhruvil

On Thu, Jan 16, 2020 at 3:24 AM Colin McCabe  wrote:

> On Wed, Jan 15, 2020, at 03:54, Dhruvil Shah wrote:
> > Hi Colin,
> >
> > We could add a configuration to disable stray partition deletion if
> needed,
> > but I wasn't sure if an operator would really want to disable it. Perhaps
> > if the implementation were buggy, the configuration could be used to
> > disable the feature until a bug fix is made. Is that the kind of use case
> > you were thinking of?
> >
> > I was thinking that there would not be any delay between detection and
> > deletion of stray logs. We would schedule an async task to do the actual
> > deletion though.
>
> Based on my experience in HDFS, immediately deleting data that looks out
> of place can cause severe issues when a bug occurs.  See
> https://issues.apache.org/jira/browse/HDFS-6186 for details.  So I really
> do think there should be a delay, and a metric + log message in the
> meantime to alert the operators to what is about to happen.
>
> best,
> Colin
>
> >
> > Thanks,
> > Dhruvil
> >
> > On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe 
> wrote:
> >
> > > Hi Dhruvil,
> > >
> > > Thanks for the KIP.  I think there should be some way to turn this
> off, in
> > > case that becomes necessary.  I'm also curious how long we intend to
> wait
> > > between detecting the duplication and  deleting the extra logs.  The
> KIP
> > > says "scheduled for deletion" but doesn't give a time frame -- is it
> > > assumed to be immediate?
> > >
> > > best,
> > > Colin
> > >
> > >
> > > On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> > > > If there are no more questions or concerns, I will start a vote
> thread
> > > > tomorrow.
> > > >
> > > > Thanks,
> > > > Dhruvil
> > > >
> > > > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah 
> > > wrote:
> > > >
> > > > > Hi Nikhil,
> > > > >
> > > > > Thanks for looking at the KIP. The kind of race condition you
> mention
> > > is
> > > > > not possible as stray partition detection is done synchronously
> while
> > > > > handling the LeaderAndIsrRequest. In other words, we atomically
> > > evaluate
> > > > > the partitions the broker must host and the extra partitions it is
> > > hosting
> > > > > and schedule deletions based on that.
> > > > >
> > > > > One possible shortcoming of the KIP is that we do not have the
> ability
> > > to
> > > > > detect a stray partition if the topic has been recreated since. We
> will
> > > > > have the ability to disambiguate between different generations of a
> > > > > partition with KIP-516.
> > > > >
> > > > > Thanks,
> > > > > Dhruvil
> > > > >
> > > > > On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia <
> nik...@confluent.io>
> > > > > wrote:
> > > > >
> > > > >> Thanks Dhruvil, the proposal looks reasonable to me.
> > > > >>
> > > > >> is there a potential of a race between a new topic being assigned
> to
> > > the
> > > > >> same node that is still performing a cleanup of the stray
> partition ?
> > > > >> Topic
> > > > >> ID will definitely solve this issue.
> > > > >>
> > > > >> Thanks
> > > > >> Nikhil
> > > > >>
> > > > >> On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
> > > > >> > Here is the link to the KIP:>
> > > > >> >
> > > > >>
> > > > >>
> > >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
> > > > >> >
> > > > >>
> > > > >> >
> > > > >> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah  >
> > > > >> wrote:>
> > > > >> >
> > > > >> > > Hi all, I would like to kick off discussion for KIP-550 which
> > > proposes
> > > > >> a>
> > > > >> > > mechanism to detect and delete stray partitions on a broker.
> > > > >> Suggestions>
> > > > >> > > and feedback are welcome.>
> > > > >> > >>
> > > > >> > > - Dhruvil>
> > > > >> > >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-15 Thread Colin McCabe
On Wed, Jan 15, 2020, at 03:54, Dhruvil Shah wrote:
> Hi Colin,
> 
> We could add a configuration to disable stray partition deletion if needed,
> but I wasn't sure if an operator would really want to disable it. Perhaps
> if the implementation were buggy, the configuration could be used to
> disable the feature until a bug fix is made. Is that the kind of use case
> you were thinking of?
> 
> I was thinking that there would not be any delay between detection and
> deletion of stray logs. We would schedule an async task to do the actual
> deletion though.

Based on my experience in HDFS, immediately deleting data that looks out of 
place can cause severe issues when a bug occurs.  See 
https://issues.apache.org/jira/browse/HDFS-6186 for details.  So I really do 
think there should be a delay, and a metric + log message in the meantime to 
alert the operators to what is about to happen.

best,
Colin

> 
> Thanks,
> Dhruvil
> 
> On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe  wrote:
> 
> > Hi Dhruvil,
> >
> > Thanks for the KIP.  I think there should be some way to turn this off, in
> > case that becomes necessary.  I'm also curious how long we intend to wait
> > between detecting the duplication and  deleting the extra logs.  The KIP
> > says "scheduled for deletion" but doesn't give a time frame -- is it
> > assumed to be immediate?
> >
> > best,
> > Colin
> >
> >
> > On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> > > If there are no more questions or concerns, I will start a vote thread
> > > tomorrow.
> > >
> > > Thanks,
> > > Dhruvil
> > >
> > > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah 
> > wrote:
> > >
> > > > Hi Nikhil,
> > > >
> > > > Thanks for looking at the KIP. The kind of race condition you mention
> > is
> > > > not possible as stray partition detection is done synchronously while
> > > > handling the LeaderAndIsrRequest. In other words, we atomically
> > evaluate
> > > > the partitions the broker must host and the extra partitions it is
> > hosting
> > > > and schedule deletions based on that.
> > > >
> > > > One possible shortcoming of the KIP is that we do not have the ability
> > to
> > > > detect a stray partition if the topic has been recreated since. We will
> > > > have the ability to disambiguate between different generations of a
> > > > partition with KIP-516.
> > > >
> > > > Thanks,
> > > > Dhruvil
> > > >
> > > > On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia 
> > > > wrote:
> > > >
> > > >> Thanks Dhruvil, the proposal looks reasonable to me.
> > > >>
> > > >> is there a potential of a race between a new topic being assigned to
> > the
> > > >> same node that is still performing a cleanup of the stray partition ?
> > > >> Topic
> > > >> ID will definitely solve this issue.
> > > >>
> > > >> Thanks
> > > >> Nikhil
> > > >>
> > > >> On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
> > > >> > Here is the link to the KIP:>
> > > >> >
> > > >>
> > > >>
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
> > > >> >
> > > >>
> > > >> >
> > > >> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah 
> > > >> wrote:>
> > > >> >
> > > >> > > Hi all, I would like to kick off discussion for KIP-550 which
> > proposes
> > > >> a>
> > > >> > > mechanism to detect and delete stray partitions on a broker.
> > > >> Suggestions>
> > > >> > > and feedback are welcome.>
> > > >> > >>
> > > >> > > - Dhruvil>
> > > >> > >>
> > > >> >
> > > >>
> > > >
> > >
> >
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-15 Thread Dhruvil Shah
Hi Colin,

We could add a configuration to disable stray partition deletion if needed,
but I wasn't sure if an operator would really want to disable it. Perhaps
if the implementation were buggy, the configuration could be used to
disable the feature until a bug fix is made. Is that the kind of use case
you were thinking of?

I was thinking that there would not be any delay between detection and
deletion of stray logs. We would schedule an async task to do the actual
deletion though.

Thanks,
Dhruvil

On Tue, Jan 14, 2020 at 11:04 PM Colin McCabe  wrote:

> Hi Dhruvil,
>
> Thanks for the KIP.  I think there should be some way to turn this off, in
> case that becomes necessary.  I'm also curious how long we intend to wait
> between detecting the duplication and  deleting the extra logs.  The KIP
> says "scheduled for deletion" but doesn't give a time frame -- is it
> assumed to be immediate?
>
> best,
> Colin
>
>
> On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> > If there are no more questions or concerns, I will start a vote thread
> > tomorrow.
> >
> > Thanks,
> > Dhruvil
> >
> > On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah 
> wrote:
> >
> > > Hi Nikhil,
> > >
> > > Thanks for looking at the KIP. The kind of race condition you mention
> is
> > > not possible as stray partition detection is done synchronously while
> > > handling the LeaderAndIsrRequest. In other words, we atomically
> evaluate
> > > the partitions the broker must host and the extra partitions it is
> hosting
> > > and schedule deletions based on that.
> > >
> > > One possible shortcoming of the KIP is that we do not have the ability
> to
> > > detect a stray partition if the topic has been recreated since. We will
> > > have the ability to disambiguate between different generations of a
> > > partition with KIP-516.
> > >
> > > Thanks,
> > > Dhruvil
> > >
> > > On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia 
> > > wrote:
> > >
> > >> Thanks Dhruvil, the proposal looks reasonable to me.
> > >>
> > >> is there a potential of a race between a new topic being assigned to
> the
> > >> same node that is still performing a cleanup of the stray partition ?
> > >> Topic
> > >> ID will definitely solve this issue.
> > >>
> > >> Thanks
> > >> Nikhil
> > >>
> > >> On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
> > >> > Here is the link to the KIP:>
> > >> >
> > >>
> > >>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
> > >> >
> > >>
> > >> >
> > >> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah 
> > >> wrote:>
> > >> >
> > >> > > Hi all, I would like to kick off discussion for KIP-550 which
> proposes
> > >> a>
> > >> > > mechanism to detect and delete stray partitions on a broker.
> > >> Suggestions>
> > >> > > and feedback are welcome.>
> > >> > >>
> > >> > > - Dhruvil>
> > >> > >>
> > >> >
> > >>
> > >
> >
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-14 Thread Colin McCabe
Hi Dhruvil,

Thanks for the KIP.  I think there should be some way to turn this off, in case 
that becomes necessary.  I'm also curious how long we intend to wait between 
detecting the duplication and  deleting the extra logs.  The KIP says 
"scheduled for deletion" but doesn't give a time frame -- is it assumed to be 
immediate?

best,
Colin


On Tue, Jan 14, 2020, at 05:56, Dhruvil Shah wrote:
> If there are no more questions or concerns, I will start a vote thread
> tomorrow.
> 
> Thanks,
> Dhruvil
> 
> On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah  wrote:
> 
> > Hi Nikhil,
> >
> > Thanks for looking at the KIP. The kind of race condition you mention is
> > not possible as stray partition detection is done synchronously while
> > handling the LeaderAndIsrRequest. In other words, we atomically evaluate
> > the partitions the broker must host and the extra partitions it is hosting
> > and schedule deletions based on that.
> >
> > One possible shortcoming of the KIP is that we do not have the ability to
> > detect a stray partition if the topic has been recreated since. We will
> > have the ability to disambiguate between different generations of a
> > partition with KIP-516.
> >
> > Thanks,
> > Dhruvil
> >
> > On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia 
> > wrote:
> >
> >> Thanks Dhruvil, the proposal looks reasonable to me.
> >>
> >> is there a potential of a race between a new topic being assigned to the
> >> same node that is still performing a cleanup of the stray partition ?
> >> Topic
> >> ID will definitely solve this issue.
> >>
> >> Thanks
> >> Nikhil
> >>
> >> On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
> >> > Here is the link to the KIP:>
> >> >
> >>
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
> >> >
> >>
> >> >
> >> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah 
> >> wrote:>
> >> >
> >> > > Hi all, I would like to kick off discussion for KIP-550 which proposes
> >> a>
> >> > > mechanism to detect and delete stray partitions on a broker.
> >> Suggestions>
> >> > > and feedback are welcome.>
> >> > >>
> >> > > - Dhruvil>
> >> > >>
> >> >
> >>
> >
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-14 Thread Dhruvil Shah
If there are no more questions or concerns, I will start a vote thread
tomorrow.

Thanks,
Dhruvil

On Mon, Jan 13, 2020 at 6:59 PM Dhruvil Shah  wrote:

> Hi Nikhil,
>
> Thanks for looking at the KIP. The kind of race condition you mention is
> not possible as stray partition detection is done synchronously while
> handling the LeaderAndIsrRequest. In other words, we atomically evaluate
> the partitions the broker must host and the extra partitions it is hosting
> and schedule deletions based on that.
>
> One possible shortcoming of the KIP is that we do not have the ability to
> detect a stray partition if the topic has been recreated since. We will
> have the ability to disambiguate between different generations of a
> partition with KIP-516.
>
> Thanks,
> Dhruvil
>
> On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia 
> wrote:
>
>> Thanks Dhruvil, the proposal looks reasonable to me.
>>
>> is there a potential of a race between a new topic being assigned to the
>> same node that is still performing a cleanup of the stray partition ?
>> Topic
>> ID will definitely solve this issue.
>>
>> Thanks
>> Nikhil
>>
>> On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
>> > Here is the link to the KIP:>
>> >
>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
>> >
>>
>> >
>> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah 
>> wrote:>
>> >
>> > > Hi all, I would like to kick off discussion for KIP-550 which proposes
>> a>
>> > > mechanism to detect and delete stray partitions on a broker.
>> Suggestions>
>> > > and feedback are welcome.>
>> > >>
>> > > - Dhruvil>
>> > >>
>> >
>>
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-13 Thread Dhruvil Shah
Hi Nikhil,

Thanks for looking at the KIP. The kind of race condition you mention is
not possible as stray partition detection is done synchronously while
handling the LeaderAndIsrRequest. In other words, we atomically evaluate
the partitions the broker must host and the extra partitions it is hosting
and schedule deletions based on that.

One possible shortcoming of the KIP is that we do not have the ability to
detect a stray partition if the topic has been recreated since. We will
have the ability to disambiguate between different generations of a
partition with KIP-516.

Thanks,
Dhruvil

On Sat, Jan 11, 2020 at 11:40 AM Nikhil Bhatia  wrote:

> Thanks Dhruvil, the proposal looks reasonable to me.
>
> is there a potential of a race between a new topic being assigned to the
> same node that is still performing a cleanup of the stray partition ? Topic
> ID will definitely solve this issue.
>
> Thanks
> Nikhil
>
> On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
> > Here is the link to the KIP:>
> >
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker
> >
>
> >
> > On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah  wrote:>
> >
> > > Hi all, I would like to kick off discussion for KIP-550 which proposes
> a>
> > > mechanism to detect and delete stray partitions on a broker.
> Suggestions>
> > > and feedback are welcome.>
> > >>
> > > - Dhruvil>
> > >>
> >
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-10 Thread Nikhil Bhatia
Thanks Dhruvil, the proposal looks reasonable to me.

is there a potential of a race between a new topic being assigned to the
same node that is still performing a cleanup of the stray partition ? Topic
ID will definitely solve this issue.

Thanks
Nikhil

On 2020/01/06 04:30:20, Dhruvil Shah  wrote:
> Here is the link to the KIP:>
>
https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker>

>
> On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah  wrote:>
>
> > Hi all, I would like to kick off discussion for KIP-550 which proposes
a>
> > mechanism to detect and delete stray partitions on a broker.
Suggestions>
> > and feedback are welcome.>
> >>
> > - Dhruvil>
> >>
>


Re: [DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-05 Thread Dhruvil Shah
Here is the link to the KIP:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-550%3A+Mechanism+to+Delete+Stray+Partitions+on+Broker

On Mon, Jan 6, 2020 at 9:59 AM Dhruvil Shah  wrote:

> Hi all, I would like to kick off discussion for KIP-550 which proposes a
> mechanism to detect and delete stray partitions on a broker. Suggestions
> and feedback are welcome.
>
> - Dhruvil
>


[DISCUSS] KIP-550: Mechanism to Delete Stray Partitions on Broker

2020-01-05 Thread Dhruvil Shah
Hi all, I would like to kick off discussion for KIP-550 which proposes a
mechanism to detect and delete stray partitions on a broker. Suggestions
and feedback are welcome.

- Dhruvil