> Couples of notes:
> 
> 1.
> 
> > In the LedgerDeletionService  start, it will create  a producer to send
> > pending delete ledger.
> > When delete a ledger,  a PendingDeleteLedgerInfo msg with 1 min delay (the
> > delay is for consumer side, if send it immediately, maybe the metadata
> > din't change when consumer receive it). After the send operation succeeds,
> >  then to operate metadata. If send msg failed, we think this deletion
> > operation failed, and didn't operate metadata.
> 
> 
> This section is completely unclear to me. Can you please provide step by
> step exactly what will happen in the workflow?
> Who does what and where (node)?

In PulsarService, it defines ledgerDeletionService. The ledgerDeletionService 
will create a producer and a consumer, which looks like topicPoliciesService. 
The PulsarService passes it to ManagedLedger. 
When pulsar wants to delete a ledger, ManagedLedger uses ledgerDeletionService 
to send a message, the message content contains the waiting delete ledger info. 
After sending success, delete the ledger id from the metadata store. 
The consumer receives the message, it will use PulsarClient to send a delete 
command to the corresponding broker, the broker receives delete command, and do 
the actual delete operation.

In https://github.com/apache/pulsar/issues/16569, these are some pictures for 
the workflow.


> 
> 2.
> 
> > /**
> >      * The ledger component . managed-ledger, managed-cursor and
> > schema-storage.
> >      */
> >     private LedgerComponent ledgerComponent;
> 
> 
> Why is this needed?
> What do you mean by a component of a ledger? Is the ledger divided into
> components?

It's the ledger source, (MANAGED_LEDGER,MANAGED_CURSOR,SCHEMA_STORAGE)
When the broker wants to delete a ledger, we will check if the bookkeeper 
metadata matches or not. In the pulsar, it will mark the ledger source when 
creating a new ledger. See LedgerMetadataUtils.

> 
> 3.
> 
> >   /**
> >      * The ledger type. ledger or offload-ledger.
> >      */
> >     private LedgerType ledgerType;
> 
> 
> I don't understand why you need this type.
It marks the ledger as a normal ledger or an offload ledger, broker need it to 
determine whether to delete bookkeeper data or offload data.

> 
> 4.
> 
> > private MLDataFormats.ManagedLedgerInfo.LedgerInfo context;
> 
> 
> Context is a very generic word, but your type is so specific. Can you
> please explain why you need this for?
> 
> Are you sure you want to tie one data structure into the other - just
> validating.
It's for the offloaded ledger, when we want to delete the offload ledger, we 
need offloadedContextUuid, here we can simplify it to offloadContextUuid.

> 
> 5.
> 
> >   /**
> >      * Extent properties.
> >      */
> >     private Map<String, String> properties = new HashMap<>();
> 
> 
> Why is this needed?
It's for extended.

> 
> 
> 6.
> 
> > When receiving a pending delete ledger msg, we will check if the topic
> > still exists. If the topic exists, send a delete command
> > (PendingDelteLedger) to the broker which owns the topic. In the broker, it
> > will check if the ledger is still in the metadata store, if the ledger in
> > the metadata store means the ledger is still in use, give up to delete this
> > ledge
> 
> 
> I don't understand this workflow. You say you check if it's in the metadata
> store, and if it is , then it is used - what will make it unused?
> Can you explain the starting point? How does deletion work in general?
> When? What happens? ... I understand there are time based triggers, and
> sometimes used based triggers. They are somehow marked in metadata.
In https://github.com/apache/pulsar/issues/16569. The first step section and 
second step section process flow picture are detailed.

> 
> 7.
> 
> If we delete successfully, the consumer will ack this msg. If delete fails,
> > reconsume this msg 10 min later.
> 
> 
> Where did you define 10min?
> Why 10 min?
If delete fails, that means the storage system occur some problems. I guess the 
storage system will recovery in 10 mins.  

In https://github.com/apache/pulsar/issues/16569, we define 
reconsumeLaterOfTopicTwoPhaseDeletionInSeconds in the ServiceConfiguration, 
it's configurable.
private int reconsumeLaterOfTopicTwoPhaseDeletionInSeconds = 600;

> 
> You mentioned you are working with a client which has retries configured.
> Retry is client side based, ack one message while producing another,
> transaction free. Are you prepared to handle a case where you acked but
> failed to produce the message, hence you lost it completely?
> 
The pulsarClient only sends a new message that succeeds, then ack the origin 
message, so didn't care in this case.

> 8.
> 
> > If we want to delete a topic, we should send the delete ledger msg to
> > system topic and remove ledger id from metadata one by one, after all the
> > ledger has been deleted, then delete the topic. If any ledger operate
> > failed, we think this delete topic operation failed and return the left
> > ledger id in the response.
> 
> I couldn't understand. Can you please clarify this section. How exactly
> topic deletion is modified to use the features described in this pip?
> 
We need to ensure that all ledgers are deleted before the topic is deleted, 
otherwise, there will be orphan ledgers.

> 9.> 
> Regarding configuration. I suggest we prefix all config keys with the
> feature name so we can easily separate them.
That's fine.

> 
> 10.
> Backward compatibility - I would make sure to document in the docs exactly
> what to do in case we need to rollback (cook book).
Well.

> 
> 11.
> General comment - You're basically implementing a bespoke workflow using a
> topic to save the state of where you are in the topic.
> Is this the only place in Pulsar (delete ledger) that an action is composed
> of several steps ?
> If the answer is no to this, wouldn't it be better to have a small utility
> which is in charge of moving through the workflow steps? It can even be a
> simple state enum, where you move your state from a to b to c to d and it
> is persisted.
We need to persist in the middle steps, and we didn't want to operate the 
metadata store continually, so used pulsar to persist it.

 
> 12. Monitoring
> Some actions here can take a long time. We're basically relying on logs to
> monitor where we are in the flow?
yes, we didn't trace the ledger deletion steps. we only use stats to record 
whether the delete operation succeeds or not.

Reply via email to