Re: [DISCUSS]PIP-295: Fixing Chunk Message Duplication Issue

Xiangying Meng Wed, 23 Aug 2023 02:16:17 -0700

Hi Zike
Thank you for your attention.
>For the existing behavior, the consumer assembles messages 3,4,5 into the 
>original large message. But the changes brought about by this PIP will cause 
>the consumer to use messages 1,2,5 for assembly. There is no guarantee that 
>the The producer will split the message in the same way twice before and 
>after. For example, the producer's maxMessageSize may be different. This may 
>cause the consumer to receive a corrupt message.


For the previous behavior, if duplication is not enabled, messages 1,
2, 3, 4, and 5 will all be persisted. When the consumer receives
message 3  (sequenceID: 0, ChunkID: 0), it will find that the message
is out of order and rewind the cursor. Loop this operation, and
discard this message after it expires instead of assembling 3, 4, 5
into a message.
If duplication is enabled, the chunk messages 2, 3, 4, and 5 will be
filtered. The message also will be discarded.

>I think a simple better approach is to only check the deduplication for the 
>last chunk of the large message.

This solution also cannot solve the out-of-order messages inside the
chunks. For example, the above five messages will still be persisted.

On Wed, Aug 23, 2023 at 12:34 PM Zike Yang <[email protected]> wrote:
>
> Hi, xiangying,
>
> Thanks for your PIP.
>
> IIUC, this may change the existing behavior and may introduce inconsistencies.
> Suppose that we have a large message with 3 chunks. But the producer
> crashes and resends the message after sending the chunk-1. It will
> send a total of 5 messages to the Pulsar topic:
>
> 1. SequenceID: 0, ChunkID: 0
> 2. SequenceID: 0, ChunkID: 1
> 3. SequenceID: 0, ChunkID: 0   -> This message will be dropped
> 4. SequenceID: 0, ChunkID: 1    -> Will also be dropped
> 5. SequenceID: 0, ChunkID: 2    -> The last chunk of the message
>
> For the existing behavior, the consumer assembles messages 3,4,5 into
> the original large message. But the changes brought about by this PIP
> will cause the consumer to use messages 1,2,5 for assembly. There is
> no guarantee that the producer will split the message in the same way
> twice before and after. For example, the producer's maxMessageSize may
> be different. This may cause the consumer to receive a corrupt
> message.
>
> Also, this PIP increases the complexity of handling chunks on the
> broker side. Brokers should, in general, treat the chunk as a normal
> message.
>
> I think a simple better approach is to only check the deduplication
> for the last chunk of the large message. The consumer only gets the
> whole message after receiving the last chunk. We don't need to check
> the deduplication for all previous chunks. Also by doing this we only
> need bug fixes, we don't need to introduce a new PIP.
>
> BR,
> Zike Yang
>
> On Fri, Aug 18, 2023 at 7:54 PM Xiangying Meng <[email protected]> wrote:
> >
> > Dear Community,
> >
> > I hope this email finds you well. I'd like to address an important
> > issue related to Apache Pulsar and discuss a solution I've proposed on
> > GitHub. The problem pertains to the handling of Chunk Messages after
> > enabling deduplication.
> >
> > In the current version of Apache Pulsar, all chunks of a Chunk Message
> > share the same sequence ID. However, enabling the depublication
> > feature results in an inability to send Chunk Messages. To tackle this
> > problem, I've proposed a solution [1] that ensures messages are not
> > duplicated throughout end-to-end delivery. While this fix addresses
> > the duplication issue for end-to-end messages, there remains a
> > possibility of duplicate chunks within topics.
> >
> > To address this concern, I believe we should introduce a "Chunk ID
> > map" at the Broker level, similar to the existing "sequence ID map",
> > to facilitate effective filtering. However, implementing this has led
> > to a challenge: a producer requires storage for two Long values
> > simultaneously (sequence ID and chunk ID). Because the snapshot of the
> > sequence ID map is stored through the properties of the cursor
> > (Map<String, Long>), so in order to satisfy the storage of two Longs
> > (sequence ID, chunk ID) corresponding to one producer, we hope to add
> > a mark DeleteProperties (Map<String, Long>) String, String>) to
> > replace the properties (Map<String, Long>) field. To resolve this,
> > I've proposed an alternative proposal [2] involving the introduction
> > of a "mark DeleteProperties" (Map<String, String>) to replace the
> > current properties (Map<String, Long>) field.
> >
> > I'd appreciate it if you carefully review both PRs and share your
> > valuable feedback and insights. Thank you immensely for your time and
> > attention. I eagerly anticipate your valuable opinions and
> > recommendations.
> >
> > Warm regards,
> > Xiangying
> >
> > [1] https://github.com/apache/pulsar/pull/20948
> > [2] https://github.com/apache/pulsar/pull/21027

Re: [DISCUSS]PIP-295: Fixing Chunk Message Duplication Issue

Reply via email to