2020-01-26 10:07:04 UTC - Gaetan SNL: @Sijie Guo thank you ! I thought it was already implemented. ---- 2020-01-26 10:50:56 UTC - Eugen: Suppose I have 2 (redundant) producers that under normal circumstances produce the exact same messages with the same SequenceIds to the same topic, but sometimes there may be gaps, so e.g. producer 1 produces messages 1, 2, 3, whereas producer 2 produces 1, 3 - so it does not produce message 2 (but the SequenceIds of producer 2 match those of producer 1 for all messages), and sometimes, one producer may fail for an extended period of time, but when it does come back its messages and seqence ids are in sync with the other producer, although there will be a large gap for the recovered producer. Can Pulsar take care of deduplication in this case as well? The docs for `ProducerBuilder.producerName()` seem to indicate otherwise: ```Specify a name for the producer. ... Brokers will enforce that only a single producer a given name can be publishing on a topic.``` So it seems I would either a) have to put a process in-between the 2 redundant processes and the pulsar producer process in order to take care of the deduplication, hence creating a SPOF, or b) produce into 2 different topics and do deduplication using another process (a Pulsar Function perhaps?) that reads off of those 2 topics, incurring a lot of overhead and latency ---- 2020-01-26 12:23:08 UTC - Fernando: So I managed to add mTLS to the Elasticsearch sink. It connects and I see a new index in elasticsearch created by the sink. However, no documents seem to be pushed to the index. The stats suggest messages are going through ```INFO org.apache.pulsar.client.impl.ConsumerStatsRecorderImpl - [<persistent://packhunt/billing/aggregated-billing-records>] [packhunt/billing/aggregate-records-es-sink] [78d97] Prefetched messages: 0 --- Consume throughput received: 1.23 msgs/s --- 0.00 Mbit/s --- Ack sent rate: 0.00 ack/s --- Failed messages: 0 --- batch messages: 0 ---Failed acks: 0``` Could it be a compatibility issue with the version of elasticsearch? ---- 2020-01-26 14:52:36 UTC - Dan: @Dan has joined the channel ---- 2020-01-26 16:50:33 UTC - Sijie Guo: > Can Pulsar take care of deduplication in this case as well? it can. as long as your producer 2 produce the messages with same sequence ids and with increasing sequence ids.
although producer 2 can not connect to broker if producer 1 connects to 1. there is only one producer of a same name can connect to brokers. ---- 2020-01-26 16:51:11 UTC - Sijie Guo: I would suggest first trying out the connector without mTLS and trying the same elastic version. ---- 2020-01-26 20:13:31 UTC - Eugen: So this does not work with both producers active at the same time. The use case is this: We have a UDP multicast data coming source that we receive twice, completely redundantly (different switches etc.), and - except for the multicast group numbers - both sources are completely identical. What I would like to have is two producers, one producer each receiving one of those redundant streams, and both of them, active/active, pushing into Pulsar. Pulsar would throw away almost half of the data, but in cases, where one of the producers is missing a UDP packet, it would use te UDP packet of the other producer. But I now see this won't work for two reasons: 1) as you said only one can be connected at the same time 2) Pulsar does not care if there is a gap in sequence numbers, so even if it was able to deduplicate 2 (normally) identical streams, in cases where there are gaps, the order may not be preserved if producer p1 produces x, x+2 (x+1 being a udp packet that did not make it to p1) and then p2 produces x + 1 ---- 2020-01-26 20:20:38 UTC - Dan: hey, has anyone been able to get a Pulsar SQL cluster up and running out of the box with 2.5.0? it fails for me. added this issue: <https://github.com/apache/pulsar/issues/6146> ---- 2020-01-26 20:53:22 UTC - Eugen: So would creating (and implementing) a PIP for this "active/active deduplication" use case with these 2 (orthogonal) features be realistic? 1. add a "deduplicate across producers" option to topics, which would make sure that the last seen SeqId is stored and handled per topic, not per producer 2. add an option for deduplication which orders messages with out-of-order SeqIds For this exact use case, I have had to implement this feature twice already, so I know how to do it. With perhaps a difference in that this ordering and deduplication was performed in memory, whereas in the case of Pulsar we would perhaps want to persist messages as they come in (unordered) first, as Pulsar makes guarantees about messages being persisted once acked to the producer, For 2., it would mean that whenever there is a gap in the incoming SeqId, the broker would hold the message in a buffer and wait for a configurable number of milliseconds in which the gap may get filled, or if not, the broker will give up and make the gap permanent (if the missing message came in afterwards, it would get discarded. ---- 2020-01-26 21:15:36 UTC - Eugen: So for 2. there would be at least 1 configurable parameter: *missing-seqid-message-wait-timeout*. ---- 2020-01-26 21:17:01 UTC - Eugen: And for our use-case, we'd need another, also orthogonal, independently implementable feature, as the SeqIds are reset to 0 (or 1, don't remember) at the start of the day, so I'd like to add 3. add admin (?) function to reset HighestSequenceId ---- 2020-01-26 21:26:02 UTC - Eugen: This btw is a use-case coming out of the stock exchange market data feed world, so it is somewhat different from most web based event stream use cases, where you never have 2 (almost) exactly redundant streams - although I'd imagine Yahoo Finance, which is said to be an early (and ongoing?) user of Pulsar, may be able to make use of this as well ---- 2020-01-26 22:04:31 UTC - Eugen: Anyways, this is a topic for the *dev* channel, so I'll restate my case over there ---- 2020-01-27 02:49:54 UTC - Antti Kaikkonen: @Antti Kaikkonen has joined the channel ---- 2020-01-27 06:18:28 UTC - Fernando: First thing I'd try is setting the IP of the proxy instead of the individual brokers ---- 2020-01-27 08:44:12 UTC - Kevin Huber: @Kevin Huber has joined the channel ----
