GitHub user artur-jablonski created a discussion: Effectively-once delivery 
guarantees when publishing to Pulsar topic from external source.

Hi, 

I am trying to understand under what conditions is it possible to have 
**EFFECTIVELY once** delivery guarantees towards Pulsar topic from some 
**external data source.** 

Based on those reads:

https://github.com/apache/pulsar/wiki/PIP-6:-Guaranteed-Message-Deduplication
https://streamnative.io/blog/exactly-once-semantics-transactions-pulsar

My understanding is that there are two options:

### Option 1:

Each discrete data item that is to be published from the external source to 
Pulsar topic needs to be associated with a monotonically increasing sequence 
number and it must be possible for the consumer of the data source to resume 
consuming data from data item given its sequence number. 

In this scenario, the server side topic deduplication must be enabled on the 
target topic and the Pulsar Producer must use data items' sequence number when 
publishing on Pulsar topic. When Pulsar Producer connects to the topic it asks 
Pulsar Broker for the latest sequence number that it has published on the 
topic. It must use that number to position the data source reading on the next 
data item to be published and resume from there. 

_Limitations:_

- Generally can't go over parallelism=1 on the external source unless there's a 
scheme where data source can be split into disjoint datasets, each consumed by 
a separate Pulsar Publisher with unique name, since Pulsar Broker side 
deduplication works per Pulsar Publisher name. 

- Pulsar Broker side deduplication is per partition, so needs to be taken into 
consideration if the target topic is partitioned.

_Notes:_

The Pulsar Broker side deduplication also works without associating sequence 
number with data items. In this case Pulsar Producer uses internal sequence 
number that is incremented with each published data item. When Pulsar Producer 
gets timeout on Broker's ACK for given data item, it can attempt to republish 
the data using the same sequence number. This allows the Broker to deduplicate 
in case the data item was successfully published already. However, when Pulsar 
Publisher itself suffers a failure and restarts, then since the sequence number 
is not associated in any way with data items, it's possible the same data item 
will be published on the topic and EFFECTIVELY once guarantees are broken.

_Examples of data source that can be used in this strategy:_

- flat text file where each line is considered a data item. Line number is used 
as sequence number and after restart and getting last sequence number from 
Pulsar Broker, it's possible to position the consumer on appropriate line 
number. This assumes the file doesn't change in between. 

- Kafka topic. Pulsar Producer can use Kafka topic offset as sequence number. 
After restart and getting last sequence number from Pulsar Broker, the Pulsar 
Producer can position itself on appropriate record in Kafka topic. Note: Kafka 
maintains offset per topic partition, therefore if data source is Kafka topic 
with more than 1 partition, then each would need to be published by a separate 
Pulsar Producer with unique name. 


### Option 2:

For this strategy the source needs to:
 
- have ability to acknowledge processing a data item

- ability to publish data item to Pulsar topic and acknowledge processing data 
item ATOMICALLY

In this strategy the publishing to Pulsar topic and acknowledging back to the 
source is bound by transactional all-or-nothing semantics. 

_Examples of data sources that can be used in this startegy:_

Apache Flink's Pulsar Sink uses its two phase commit protocol capabilities and 
Pulsar's Transaction API that becomes participant of the 2PC to make publishing 
to Pulsar and ACK back to Flink atomic therefore achieves EFFECTIVELY ONCE 
semantics. 

--------------------------

_An example or a source that cannot have effectively once semantics:_

If we take a JMS queue as the external data source we can say that:

- there is no sequence number that can be assigned to data items published on 
JMS queue and more importantly the JMS client cannot position itself on 
specific data item to consume. 

- there is no way to tie acknowledging JMS queue message consumption with 
publishing it to Pulsar topic in an atomic transaction.

Therefore there is **no strategy that exists** to achieve EFFECTIVELY ONCE 
delivery semantics from JMS queue to Pulsar topic.

-------------------------

Please correct me if I got this wrong. 

Thanks! 




GitHub link: https://github.com/apache/pulsar/discussions/24605

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to