2018-11-06 13:34:48 UTC - Aaron Claypoole: @Aaron Claypoole has joined the 
channel
----
2018-11-06 21:49:08 UTC - Aniket: In case of active standby replication, in 
case when active dc goes down and producer/consumer are moved to fallback dc; 
is there any information on how offsets are determined?  Is there a possibility 
of message loss if the offset is moved ahead during this time, as there can be 
few messages in source dc that has not yet been replicated to fallback dc, and 
offset is moved ahead?
----
2018-11-06 21:49:10 UTC - Aniket: 
<https://streaml.io/blog/geo-replication-patterns-practices>
----
2018-11-06 22:02:52 UTC - Matteo Merli: @Aniket the subscriptions (which 
maintain offsets) are local to one particular region, so there’s no information 
being passed on at  this point.
----
2018-11-06 22:30:19 UTC - Aniket: Ok, then in that case when the DC goes down 
and producer and consumer are moved to fallback DC, the consumer starts from 
first message?
----
2018-11-06 22:30:35 UTC - Aniket: that can lead to lot of extra work
----
2018-11-06 22:30:47 UTC - Matteo Merli: Depends how you configure the consumers.
----
2018-11-06 22:30:58 UTC - Aniket: Ok
----
2018-11-06 22:31:15 UTC - Matteo Merli: One option is to manually reset (by 
time) the subscription in the fallback DC
----
2018-11-06 22:32:08 UTC - Aniket: yes, I was thinking of that option but there 
is limitation with that as well (unless we do some extra state)
----
2018-11-06 22:32:22 UTC - Aniket: DC1 goes down, so we move the consumer to DC2
----
2018-11-06 22:32:30 UTC - Aniket: and also the producer
----
2018-11-06 22:32:57 UTC - Aniket: consumer will consume the existing replicated 
events and also start consuming new events from producer to DC2
----
2018-11-06 22:33:31 UTC - Matteo Merli: In case of a DC failover, it’s 
typically better to initiate manually
----
2018-11-06 22:34:00 UTC - Aniket: When the source DC comes back online, we will 
have to, 1. move consumer and producer to the source DC   2. replicate produced 
events from fallback DC to source DC  3. And process the events prior to fail 
over first
----
2018-11-06 22:34:09 UTC - Matteo Merli: so, before doing that, you can reset 
subscriptions to ~10mins earlier and then failover
----
2018-11-06 22:34:18 UTC - Aniket: &gt; In case of a DC failover, it’s typically 
better to initiate manually
----
2018-11-06 22:34:19 UTC - Aniket: ok
----
2018-11-06 22:34:45 UTC - Matteo Merli: Other option is to always have consumer 
in bother DCs, and just move producers
----
2018-11-06 22:35:29 UTC - Aniket: yeah, but the requirement might be to reduce 
the redundant work as low as it can be
----
2018-11-06 22:35:53 UTC - Matteo Merli: sure
----
2018-11-06 22:36:19 UTC - Aniket: I am also worried about the message loss when 
disaster recovery happens
----
2018-11-06 22:36:35 UTC - Aniket: if I try to avoid redundant work during the 
time of failover
----
2018-11-06 22:37:46 UTC - Matteo Merli: with async replication, the messages in 
flight between DC and DC2 might be either come out-of-order (when DC comes 
back) or lost, if it doesn’t come back
----
2018-11-06 22:37:59 UTC - Aniket: ok
----
2018-11-06 22:38:15 UTC - Matteo Merli: if you want to avoid that, then you 
should consider using sync-replication
----
2018-11-06 22:38:27 UTC - Aniket: But it will be committed to DC first and then 
replicated to DC2. So, technically my understanding is it will still be there 
in DC
----
2018-11-06 22:38:38 UTC - Aniket: yes, I am talking about sync-replication
----
2018-11-06 22:39:17 UTC - Matteo Merli: with sync replication, each DC is seen 
as a logical “rack” of machines
----
2018-11-06 22:39:33 UTC - Aniket: right
----
2018-11-06 22:39:48 UTC - Matteo Merli: when you configure to have 3 replicas, 
the 3 replicas will be placed in nodes in different DCs
----
2018-11-06 22:39:58 UTC - Matteo Merli: there’s no failover from client 
perspective
----
2018-11-06 22:40:26 UTC - Matteo Merli: the service will be up, even if one 
region is unavalable
----
2018-11-06 22:41:33 UTC - Aniket: Yes, I understand
----
2018-11-06 22:41:46 UTC - Aniket: my concern is specifically with offsets
----
2018-11-06 22:42:13 UTC - Aniket: because 1. it can cause redundant process of 
messages  2. if can cause loss of messages
----
2018-11-06 22:42:39 UTC - Matteo Merli: with sync replication, from Pulsar 
perspective is a single “cluster”,spanning multiple DCs
----
2018-11-06 22:42:56 UTC - Matteo Merli: Subscription (and offset) is then 
consistent in this case
----
2018-11-06 22:43:20 UTC - Matteo Merli: consumer will be automatically 
redirected to an available broker
----
2018-11-06 22:43:35 UTC - Matteo Merli: restarting from next unacked message
----
2018-11-06 22:44:33 UTC - Aniket: If there are 1 to 2M messages created in Main 
DC, and 1.5M replicated to Failover DC. Consumer consumes 500K messages in Main 
DC. Now there is an outage. at this moment the offset for topic in Main DC is 
500K, so when consumer + producer fails over to Failover DC, if the offset is 
not present the consumer will start from 1, if offset is replicated as well it 
will start at 501, but if producer is producing events in Failover DC for same 
topic, it can effectively overwrite the remaining 500K messages that were not 
replicated from Main DC and the offset can move forward.  May be I am confused 
or have some misundertanding
----
2018-11-06 22:45:37 UTC - Matteo Merli: If it’s “sync” replication, there’s no 
failover so to speak, just that some machine in the clusters are no available
----
2018-11-06 22:45:46 UTC - Aniket: i see, ok
----
2018-11-06 22:45:51 UTC - Aniket: I understand
----
2018-11-06 22:46:14 UTC - Aniket: Can you point me to some documentation that 
explains this more - so that I don’t disturb you more with questions?
----
2018-11-06 22:46:56 UTC - Aniket: 
<https://streaml.io/blog/apache-pulsar-geo-replication>
----
2018-11-06 22:47:03 UTC - Aniket: I came across this
----
2018-11-06 22:58:45 UTC - Aniket: Thanks for your feedback and answers to my 
questions :slightly_smiling_face:
----
2018-11-06 22:58:58 UTC - Aniket: appreciate your help
----
2018-11-06 23:27:44 UTC - Matteo Merli: I think we don’t have a ready-made 
tutorial I’m afraid (though we should).
The idea is to deploy the cluster spanning 3 DCs  and then use rack-aware 
policy for bookies, to make sure data is stored in all the DCs. Take a look at 
`pulsar-admin bookies` command to set that up
----
2018-11-06 23:41:03 UTC - Aniket: Ok, I will check it out, thank you
----
2018-11-07 08:41:40 UTC - David Tinker: I am busy testing Pulsar 2.2.0 using 2 
consumers on the same subscription configured for failover. It seems that 
sometimes the "inactive" consumer doesn't get activated when the "active" 
consumer shuts down. Mostly it works but sometimes it doesn't. I can see from 
/admin/v2/.../topic/stats that the live consumer is connected but I don't get 
any 'becameActive' notification or messages and the backlog builds up. I am 
using the Java client. Any ideas? I suspect I can work around this by 
periodically re-starting my "inactive" consumers.
----
2018-11-07 08:44:22 UTC - David Tinker: {
  "msgRateIn" : 0.26666650042232587,
  ...
  "publishers" : [ { ... } ],
  "subscriptions" : {
    "gammon" : {
      "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateRedeliver" : 0.0,
      "msgBacklog" : 419,
      "blockedSubscriptionOnUnackedMsgs" : false, "unackedMessages" : 0,
      "type" : "Failover",
      "msgRateExpired" : 0.0,
      "consumers" : [ {
        "msgRateOut" : 0.0, "msgThroughputOut" : 0.0, "msgRateRedeliver" : 0.0,
        "consumerName" : "gammon-e1d",
        "availablePermits" : 1000,
        "unackedMessages" : 0,
        "blockedConsumerOnUnackedMsgs" : false,
        "metadata" : { },
        "address" : "...",
        "clientVersion" : "2.2.0",
        "connectedSince" : "2018-11-07T08:18:28.751Z"
      } ]
    }
  },
----

Reply via email to